Skip to content

GitSync primitive: bind an arbitrary site-owned directory to a git remote (pull/push/scheduled-sync) #38

@chubes4

Description

@chubes4

Summary

Add a GitSync primitive to data-machine-code that binds an arbitrary site-owned local directory to a remote git repository, with pull/push semantics, per-binding policies, and scheduled sync support.

This is the layer above the existing Workspace abilities. Workspace operates on agent-owned checkouts under ~/.datamachine/workspace/<repo> for ad-hoc code edits. GitSync operates on site-owned subtrees (e.g. wp-content/uploads/markdown/wiki/, wp-content/uploads/datamachine-files/agents/<slug>/) that plugins want to keep in lockstep with a remote over time.

Intelligence is the first consumer — it needs to git-sync wiki content subtrees (Automattic/intelligence#31) and wiki-generator agent definitions (Automattic/intelligence#125). Both are the same underlying pattern: bind a local path to a remote, pull periodically, optionally push local changes back. The primitive belongs here, not duplicated in Intelligence.

Why this layering is right

Intelligence is the consumer. Data Machine Code owns the git substrate:

  • Workspace/ classes already wrap every git operation we'd need (clone_repo, git_pull, git_push, git_add, git_commit, git_status, git_diff, git_log)
  • WorkspaceAbilities already exposes these as abilities — permission-gated, callable from CLI/REST/MCP/chat
  • GitHubAbilities handles GitHub API (PR creation, issue wiring) at the same boundary
  • datamachine_workspace_git_policies already has the policy shape we need: per-repo write_enabled, push_enabled, allowed_paths, fixed_branch

The only missing concept is "bind an external path to a remote". Workspace is opinionated about path structure — it manages checkouts under ~/.datamachine/workspace/<repo>. A GitSync binding points at a site path (wp-content/uploads/... or anywhere else the site controls) and keeps that path as a git working tree against a declared remote.

Having Intelligence reimplement this would duplicate Workspace's git plumbing and policy substrate. Moving it into DMC means:

  • One place owns "the site has git-synced directories" semantics
  • Any plugin (Intelligence, a WooCommerce extension, a docs plugin) gets the primitive for free
  • Scheduled sync slots naturally into the DM flow/cron infrastructure that DMC is already adjacent to
  • Policy posture stays consistent with workspace: containment, sensitive-file blocking, explicit push enablement

Proposed shape

New module

data-machine-code/inc/GitSync/
├── GitSync.php              bind/unbind/pull/push/status against bindings
├── GitSyncBinding.php       value object: local_path, remote_url, branch, policy
├── GitSyncRegistry.php      stores bindings in datamachine_gitsync_bindings option
└── GitSyncSecurity.php      containment checks, sensitive-file filter (reuses
                              Workspace policy substrate where possible)

data-machine-code/inc/Abilities/
└── GitSyncAbilities.php     datamachine/gitsync-bind
                              datamachine/gitsync-unbind
                              datamachine/gitsync-pull
                              datamachine/gitsync-push
                              datamachine/gitsync-status
                              datamachine/gitsync-list
                              datamachine/gitsync-policy-update

data-machine-code/inc/Tasks/
└── GitSyncPullTask.php      scheduled pull task — runs hourly (configurable),
                              iterates bindings with auto_pull=true,
                              fires per-binding pull via the ability

Binding shape

array(
    'slug'         => 'intelligence-wiki',            // unique binding ID
    'local_path'   => '/uploads/markdown/wiki/',      // relative to ABSPATH
    'remote_url'   => 'https://github.com/Automattic/a8c-wiki-woocommerce',
    'branch'       => 'main',
    'policy'       => array(
        'auto_pull'     => true,                      // scheduled sync opt-in
        'pull_interval' => 'hourly',                  // uses DM scheduler intervals
        'write_enabled' => false,                     // can local commits happen?
        'push_enabled'  => false,                     // can push upstream?
        'allowed_paths' => array(),                   // write containment
        'conflict'      => 'upstream_wins',           // or 'fail' or 'manual'
    ),
    'created_at'   => '2026-04-20T12:00:00Z',
    'last_pulled'  => '2026-04-20T12:15:00Z',
    'last_commit'  => 'abc123...',
)

Stored in datamachine_gitsync_bindings option. Separate from datamachine_workspace_git_policies — different concerns (site subtrees vs agent workspace repos), so don't cram both into one option.

CLI surface

# Bind a local directory to a remote
wp datamachine-code gitsync bind intelligence-wiki \\
  --local=/uploads/markdown/wiki/ \\
  --remote=https://github.com/Automattic/a8c-wiki-woocommerce \\
  --branch=main \\
  --auto-pull=hourly

# Pull / push / status
wp datamachine-code gitsync pull intelligence-wiki
wp datamachine-code gitsync push intelligence-wiki --commit-message=\"...\"
wp datamachine-code gitsync status intelligence-wiki

# List all bindings
wp datamachine-code gitsync list

# Update policy
wp datamachine-code gitsync policy intelligence-wiki --push-enabled=true

# Remove (does NOT delete the local directory — just the binding metadata)
wp datamachine-code gitsync unbind intelligence-wiki

Programmatic API for consumers

Intelligence (and other plugins) should not call git operations directly. They should call the GitSync ability or service:

use DataMachineCode\\GitSync\\GitSync;

$sync = new GitSync();

// Bind + initial clone
$sync->bind( array(
    'slug'       => 'intelligence-wiki',
    'local_path' => '/uploads/markdown/wiki/',
    'remote_url' => 'https://github.com/...',
    'branch'     => 'main',
    'policy'     => array( 'auto_pull' => true, 'pull_interval' => 'hourly' ),
) );

// Programmatic pull (e.g. from a consumer flow)
$sync->pull( 'intelligence-wiki' );

Or via the ability surface (MCP/REST/CLI callable):

$ability = wp_get_ability( 'datamachine/gitsync-pull' );
$ability->execute( array( 'slug' => 'intelligence-wiki' ) );

Scheduled sync

A single DM scheduled task (datamachine_gitsync_tick, hourly) iterates every binding with auto_pull=true, dispatches a per-binding pull via GitSync::pull(), and logs results. This reuses DM's Action Scheduler group and respects pull_interval semantics — a binding with pull_interval=daily is skipped until 24h since last_pulled.

Security posture

Same constraints Workspace already enforces, generalized:

  • Path containment: local_path must be under ABSPATH (or an explicitly allowed root); realpath() + traversal check (.., .) rejects escapes.
  • Sensitive-file blocking: .env, credentials, keys, private SSH, anything matching existing Workspace block list.
  • Push enablement: defaults to false. push_enabled=true requires explicit admin action.
  • Branch pinning: branch is enforced — a pull that would fast-forward to a different branch fails.
  • Auth: remote credentials come from existing DM auth providers where applicable (GitHub handler already has this). HTTPS with token is the baseline; SSH keys out of scope for v1.
  • Conflict policy: upstream_wins (discard local, pull with force-reset), fail (abort pull on local diff), manual (pull but surface the conflict in logs — admin resolves). Default fail so nothing destructive happens unless opted into.

What it explicitly is not

  • Not the Workspace system. Workspace keeps its home under ~/.datamachine/workspace/ and is for agent-owned code editing checkouts (primary + worktrees). GitSync is for site-owned subtrees that a plugin wants mirrored. They share the underlying Workspace class's git methods but are configured and scoped differently.
  • Not a generic git hosting / registry. It's a sync primitive. PR review happens on GitHub; conflict resolution for tricky cases happens by admin intervention; webhook-driven real-time sync is out of scope for v1 (hourly poll is enough).
  • Not Intelligence-specific. Nothing about the primitive cares what the synced content is. Intelligence is the first consumer; others follow.

Consumers (filed or expected)

  • Automattic/intelligence#31 — Git-synced wiki content subtrees. Maps cleanly to a GitSync binding per wiki subtree (woocommerce-wiki, jetpack-wiki, etc.).
  • Automattic/intelligence#125 — Git-tracked wiki-generator agent definitions. Maps to a GitSync binding on uploads/datamachine-files/agents/wiki-generator/ pointing at github.com/Automattic/a8c-wiki-generator.
  • Both Intelligence issues will be updated to consume this primitive rather than invent their own sync code.

Relationship to existing DMC pieces

Existing Reused by GitSync Notes
Workspace\\Workspace class Yes — underlying git operations clone_repo, git_pull, git_push, git_add, git_commit are generic enough to target any path. May need small refactor to accept target directory as param instead of resolving from workspace root.
WorkspaceAbilities No — parallel abilities class Keep separate for clear mental model: workspace = agent-owned code, gitsync = site-owned subtree.
datamachine_workspace_git_policies option No — separate option Different concerns; keep storage separate.
GitHubAbilities Optional consumer Push-to-PR flow (v2) could use these for opening PRs from local changes.
DM scheduler Yes Scheduled pull task uses Action Scheduler via datamachine_gitsync_tick.

Acceptance (v1)

  • GitSync service class with bind, unbind, pull, push, status, list methods
  • 7 abilities registered under datamachine-code category (bind, unbind, pull, push, status, list, policy-update)
  • wp datamachine-code gitsync CLI surface
  • Binding storage in datamachine_gitsync_bindings option
  • Hourly GitSyncPullTask scheduled task honoring auto_pull + pull_interval
  • Path containment + sensitive-file blocking mirrored from Workspace
  • Conflict policy defaulting to fail
  • Documented in README.md + a new docs/gitsync.md

Acceptance (v2, follow-up)

  • Push-to-PR flow using GitHubAbilities to open PRs instead of direct push
  • SSH key support as alternative to HTTPS tokens
  • Webhook-driven immediate pull (replaces/augments scheduled pull)
  • Partial sync / sparse checkout for large remotes

Open questions

  1. Option storage format. Single datamachine_gitsync_bindings array option vs a custom table? v1 option is fine; move to a table if the count grows past ~50 bindings.
  2. Workspace refactor scope. How much of Workspace\\Workspace can be reused as-is? Likely most methods are generic enough; minor refactor to accept a working-directory parameter.
  3. Auth reuse. Does GitHubAuthProvider in DM cover enough, or do bindings need their own credentials (e.g. pulling from a GitLab mirror)? Start with DM's GitHub auth; generalize later.
  4. Unbind semantics. Should unbind leave the .git/ directory in place (so the local path remains a valid working tree) or remove it (clean site directory)? Flag it: --keep-git defaults to true.
  5. Bidirectional sync for mutable consumers. Intelligence wiki content can be both pulled from upstream AND edited locally. Conflict policy for that case needs real thought — probably manual with explicit admin resolution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions