Cloudstic · rmanibus · Mar 7, 2026 · Mar 7, 2026 · Mar 7, 2026
diff --git a/docs/affinity-benchmark.md b/docs/affinity-benchmark.md
@@ -0,0 +1,123 @@
+# Affinity Model Benchmark Results
+
+Benchmark script: `scripts/benchmark/affinity.sh`
+Go unit test: `internal/hamt/affinity_bench_test.go`
+
+---
+
+## E2E benchmark — `scripts/benchmark/affinity.sh`
+
+Two scenarios with **50 modified files each**, applied to the same initial tree
+(50 dirs × 50 files = 2 500 files), run as a second backup after the initial full backup.
+
+| Scenario | Change pattern | New HAMT node objects (affinity) | New HAMT node objects (legacy) |
+|---|---|---|---|
+| A — clustered | All 50 changes in `dir_01` | **18** | 75 |
+| B — scattered | 1 change in each of 50 dirs | 171 | 73 |
+
+The metric is **new `node/*` objects** written to the local store during the second
+backup. `KeyCacheStore` deduplicates writes: nodes whose content did not change are
+skipped, so only genuinely new HAMT path nodes reach the underlying store.
+
+> **Note:** The `Flushing HAMT: X reachable nodes` progress line shows **staging size**
+> (the full final tree), not delta writes. Do not use it to judge incremental cost.
+> The benchmark counts `node/*` entries in `index/packs` before/after the second backup.
+
+### Cross-binary comparison
+
+```
+# Affinity binary (RFC 0002)
+Scenario A — clustered (50 files in 1 dir):   18 new node objects
+Scenario B — scattered (1 file in 50 dirs):  171 new node objects
+Node-write reduction (A vs B): 89.5%  (153 fewer writes)
+
+# Legacy binary (pre-RFC 0002)
+Scenario A — clustered (50 files in 1 dir):   75 new node objects
+Scenario B — scattered (1 file in 50 dirs):   73 new node objects
+Difference: ~-2.7%  (no meaningful locality benefit)
+```
+
+Run the comparison yourself:
+
+```bash
+./scripts/benchmark/affinity.sh                          # current build
+CLOUDSTIC_BIN=/path/to/old-cloudstic ./scripts/benchmark/affinity.sh
+```
+
+### Why clustered beats scattered (affinity model)
+
+With `AffinityKey(parentID, fileID) = SHA256(parentID)[:4] + SHA256(fileID)[4:]`,
+all 50 files in `dir_01` share the same routing at HAMT levels 0–2 (determined by
+`SHA256("dir_01")[:4]`). They diverge only at level 3. An incremental update rewrites:
+
+- 1 root + 3 internal path nodes (L0 → L1 → L2 → L3) shared across all 50 files
+- ~14 L3 leaf nodes (one per occupied bucket at the divergence level)
+- Total: **~18 new nodes**
+
+For scattered changes (1 file per directory), each file traverses a different path
+from root — 50 distinct root-to-leaf paths are dirtied. Because affinity keys cluster
+same-dir files into deeper sub-trees, those cross-dir paths are also longer, so
+scattered writes are higher with affinity (171) than with legacy keys (73). This is an
+expected trade-off: affinity optimises the common case (changes concentrated in a
+directory) at the cost of slightly worse worst-case (fully scattered changes).
+
+### Why legacy shows no difference between A and B
+
+`SHA256(fileID)` distributes all keys uniformly across the HAMT regardless of which
+directory a file lives in. Clustered changes and scattered changes both dirty ~30 of 32
+L0 buckets. There is no shared path to exploit, so both scenarios produce roughly the
+same number of new node writes (~70–75).
+
+---
+
+## Go unit test — `TestAffinityNodeWriteReduction`
+
+Simulates an incremental backup: build a 1 000-file tree (10 dirs × 100 files),
+then update all 100 files in one directory. Only the changed files touch new HAMT paths.
+
+```bash
+go test ./internal/hamt/ -run TestAffinityNodeWriteReduction -v
+```
+
+Result:
+
+```
+Incremental update of 100 files in one directory (1000 total files, 10 dirs):
+  affinity keys :   20 node writes
+  legacy keys   :   68 node writes
+  reduction     : 70.6%  (48 fewer writes)
+```
+
+The legacy simulation uses `AffinityKey(fileID, fileID) = SHA256(fileID)`, identical
+to the old `computePathKey(fileID)` — no code changes needed.
+
+### Go benchmark — `BenchmarkIncrementalUpdate_*`
+
+```bash
+go test ./internal/hamt/ -run=^$ -bench=BenchmarkIncrementalUpdate -benchmem -benchtime=3s
+```
+
+| Strategy | ns/op | B/op | allocs/op |
+|---|---|---|---|
+| Affinity | 2 171 472 | 1 254 035 | 14 363 |
+| Legacy   | 3 812 540 | 1 968 168 | 15 992 |
+
+**~1.75× faster, ~36% less memory** for a 100-file incremental update in one directory.
+
+---
+
+## Summary
+
+| Metric | Legacy | Affinity | Delta |
+|---|---|---|---|
+| E2E: clustered 50-file update | 75 nodes | 18 nodes | **−76%** |
+| E2E: scattered 50-file update | 73 nodes | 171 nodes | +134% (expected trade-off) |
+| E2E: initial tree size (50×50) | 962 nodes | 906 nodes | −6% |
+| Unit test: 100-file update, 1 dir | 68 nodes | 20 nodes | **−71%** |
+| Unit test wall time | 3 813 µs | 2 171 µs | **−43%** |
+| Unit test memory | 1 968 KB | 1 254 KB | **−36%** |
+
+The affinity model's benefit is specifically for **incremental updates of multiple files
+in the same directory** — the dominant pattern in real workups. The scattered case (one
+change spread across every directory simultaneously) is a pathological pattern that
+affinity does not optimise for.
diff --git a/internal/core/models.go b/internal/core/models.go
@@ -62,8 +62,9 @@ type HAMTNode struct {
 
 // LeafEntry represents an entry in a Leaf node
 type LeafEntry struct {
-	Key      string `json:"key"`      // FileID
-	FileMeta string `json:"filemeta"` // "filemeta/<sha256>"
+	Key      string `json:"key"`               // FileID
+	PathKey  string `json:"path_key,omitempty"` // AffinityKey routing key; falls back to SHA256(Key) if empty
+	FileMeta string `json:"filemeta"`           // "filemeta/<sha256>"
 }
 
 // SourceInfo describes the origin of a backup snapshot. It is stored as a
@@ -87,6 +88,7 @@ type Snapshot struct {
 	Tags        []string          `json:"tags,omitempty"`
 	ChangeToken string            `json:"change_token,omitempty"`
 	ExcludeHash string            `json:"exclude_hash,omitempty"`
+	HAMTVersion int               `json:"hamt_version,omitempty"` // 1 = legacy, 2 = affinity keys
 }
 
 // Index represents a pointer to the latest snapshot

diff --git a/internal/engine/backup.go b/internal/engine/backup.go
@@ -94,6 +94,7 @@ type BackupManager struct {
 	metaCacheMu  sync.RWMutex
 	metaCache    map[string]core.FileMeta
 	pendingMetas map[string][]byte // deferred filemeta PUTs (ref → JSON)
+	parentIndex  map[string]string // fileID → primary parent fileID (for AffinityKey lookups)
 	hmacKey      []byte
 }
 
@@ -122,6 +123,7 @@ func NewBackupManager(src source.Source, dest store.ObjectStore, reporter ui.Rep
 		newMetas:     make(map[string]core.FileMeta),
 		metaCache:    make(map[string]core.FileMeta),
 		pendingMetas: make(map[string][]byte),
+		parentIndex:  make(map[string]string),
 		hmacKey:      hmacKey,
 	}
 }
@@ -321,6 +323,7 @@ func (bm *BackupManager) saveSnapshot(ctx context.Context, root string, seq int,
 		Meta:        meta,
 		ChangeToken: changeToken,
 		ExcludeHash: bm.cfg.excludeHash,
+		HAMTVersion: 2,
 	}
 
 	hash, snapData, err := core.ComputeJSONHash(&snap)

diff --git a/internal/engine/backup_scan.go b/internal/engine/backup_scan.go
@@ -38,12 +38,25 @@ type scanState struct {
 	totalBytes int64
 }
 
+// primaryParentID returns the raw source-level parent identifier for a FileMeta.
+// This is the first element of meta.Parents, which contains raw source IDs (e.g. GDrive folder IDs).
+// Returns "" for root-level entries with no parents.
+func primaryParentID(meta *core.FileMeta) string {
+	if len(meta.Parents) > 0 {
+		return meta.Parents[0]
+	}
+	return ""
+}
+
 func (bm *BackupManager) processEntry(ctx context.Context, meta *core.FileMeta, oldRoot string, s *scanState, phase ui.Phase) error {
 	if meta.Type == core.FileTypeFolder {
 		meta.ContentHash = ""
 		meta.Size = 0
 	}
 
+	// Record this entry's parent so lookupMetaByFileID can use AffinityKey.
+	bm.parentIndex[meta.FileID] = primaryParentID(meta)
+
 	// Resolve Paths when the source hasn't populated it (incremental/changes
 	// sources only emit changed entries and can't build a full path map).
 	if len(meta.Paths) == 0 {
@@ -57,7 +70,7 @@ func (bm *BackupManager) processEntry(ctx context.Context, meta *core.FileMeta,
 
 	if !changed {
 		bm.recordStat(meta.Type, false, false)
-		s.root, err = bm.tree.Insert(s.root, meta.FileID, oldRef)
+		s.root, err = bm.tree.Insert(s.root, primaryParentID(meta), meta.FileID, oldRef)
 		if err != nil {
 			return fmt.Errorf("hamt insert: %w", err)
 		}
@@ -106,7 +119,7 @@ func (bm *BackupManager) scanIncremental(ctx context.Context, oldRoot string, in
 		switch fc.Type {
 		case source.ChangeDelete:
 			bm.recordRemoved(fc.Meta.Type)
-			s.root, err = bm.tree.Delete(s.root, fc.Meta.FileID)
+			s.root, err = bm.tree.Delete(s.root, primaryParentID(&fc.Meta), fc.Meta.FileID)
 			if err != nil {
 				return fmt.Errorf("hamt delete %s: %w", fc.Meta.FileID, err)
 			}
@@ -132,7 +145,7 @@ func (bm *BackupManager) scanIncremental(ctx context.Context, oldRoot string, in
 // fast-path compares observable metadata and carries the hash forward to avoid
 // false-positive diffs.
 func (bm *BackupManager) detectChange(oldRoot string, meta *core.FileMeta) (changed bool, oldRef string, err error) {
-	oldRef, err = bm.tree.Lookup(oldRoot, meta.FileID)
+	oldRef, err = bm.tree.Lookup(oldRoot, primaryParentID(meta), meta.FileID)
 	if err != nil {
 		return false, "", fmt.Errorf("hamt lookup: %w", err)
 	}
@@ -188,7 +201,7 @@ func (bm *BackupManager) insertFolder(_ context.Context, root string, meta *core
 		bm.pendingMetas[metaRef] = metaData
 	}
 	bm.trackFileMeta(metaRef, *meta)
-	return bm.tree.Insert(root, meta.FileID, metaRef)
+	return bm.tree.Insert(root, primaryParentID(meta), meta.FileID, metaRef)
 }
 
 func (bm *BackupManager) flushPendingMetas(ctx context.Context) error {
@@ -280,10 +293,18 @@ func (bm *BackupManager) buildPathFromTree(root string, meta *core.FileMeta) str
 
 // lookupMetaByFileID resolves a FileID to its FileMeta via the HAMT tree.
 // It checks newMetas (just inserted this scan) first, then falls back to the store.
+// Uses parentIndex to resolve the AffinityKey; falls back to a full-tree walk
+// for entries not yet seen in this scan (e.g. incremental backups).
 func (bm *BackupManager) lookupMetaByFileID(root, fileID string) *core.FileMeta {
-	ref, err := bm.tree.Lookup(root, fileID)
+	parentID := bm.parentIndex[fileID]
+	ref, err := bm.tree.Lookup(root, parentID, fileID)
 	if err != nil || ref == "" {
-		return nil
+		// parentID not in index (e.g. entry from a previous snapshot not re-scanned);
+		// fall back to a walk-based lookup.
+		ref, err = bm.tree.LookupByFileID(root, fileID)
+		if err != nil || ref == "" {
+			return nil
+		}
 	}
 	if fm, ok := bm.newMetas[ref]; ok {
 		return &fm

diff --git a/internal/engine/backup_test.go b/internal/engine/backup_test.go
@@ -60,9 +60,9 @@ func TestBackupManager_ResolvesPathsForOpaqueIDs(t *testing.T) {
 	readStore := store.NewCompressedStore(dest)
 	tree := hamt.NewTree(readStore)
 
-	checkPath := func(fileID, expectedPath string) {
+	checkPath := func(parentID, fileID, expectedPath string) {
 		t.Helper()
-		ref, err := tree.Lookup(result.Root, fileID)
+		ref, err := tree.Lookup(result.Root, parentID, fileID)
 		if err != nil || ref == "" {
 			t.Fatalf("Lookup %s: ref=%q err=%v", fileID, ref, err)
 		}
@@ -81,9 +81,9 @@ func TestBackupManager_ResolvesPathsForOpaqueIDs(t *testing.T) {
 		}
 	}
 
-	checkPath("FOLDER_A", "Documents")
-	checkPath("FOLDER_B", "Documents/Photos")
-	checkPath("FILE_C", "Documents/Photos/pic.jpg")
+	checkPath("", "FOLDER_A", "Documents")
+	checkPath("FOLDER_A", "FOLDER_B", "Documents/Photos")
+	checkPath("FOLDER_B", "FILE_C", "Documents/Photos/pic.jpg")
 }
 
 func TestBackupManager_Run(t *testing.T) {
@@ -105,7 +105,7 @@ func TestBackupManager_Run(t *testing.T) {
 	lookupMeta := func(root, key string) *core.FileMeta {
 		t.Helper()
 		tree := hamt.NewTree(readStore)
-		ref, err := tree.Lookup(root, key)
+		ref, err := tree.Lookup(root, "", key)
 		if err != nil {
 			t.Fatalf("Lookup %s: %v", key, err)
 		}

diff --git a/internal/engine/backup_upload.go b/internal/engine/backup_upload.go
@@ -31,6 +31,7 @@ var inlineBufferPool = sync.Pool{
 
 type uploadResult struct {
 	fileID        string
+	parentID      string   // primary parent's raw fileID (for AffinityKey)
 	ref           string
 	meta          core.FileMeta
 	contentRef    string   // content key to cache (empty when dedup'd)
@@ -91,7 +92,7 @@ func (bm *BackupManager) upload(ctx context.Context, pending []core.FileMeta, to
 			phase.Error()
 			return "", res.err
 		}
-		root, err = bm.tree.Insert(root, res.fileID, res.ref)
+		root, err = bm.tree.Insert(root, res.parentID, res.fileID, res.ref)
 		if err != nil {
 			phase.Error()
 			return "", fmt.Errorf("hamt insert: %w", err)
@@ -128,6 +129,7 @@ func (bm *BackupManager) processFile(ctx context.Context, meta core.FileMeta, ph
 	}
 	return uploadResult{
 		fileID:        meta.FileID,
+		parentID:      primaryParentID(&meta),
 		ref:           metaRef,
 		meta:          meta,
 		contentRef:    contentRef,

diff --git a/internal/engine/check_test.go b/internal/engine/check_test.go
@@ -42,7 +42,7 @@ func buildTestRepo(t *testing.T, mockStore *MockStore) (snapRef, rootRef, metaRe
 
 	// HAMT tree
 	directTree := hamt.NewTree(mockStore)
-	rootRef, err := directTree.Insert("", "file1", metaRef)
+	rootRef, err := directTree.Insert("", "", "file1", metaRef)
 	if err != nil {
 		t.Fatalf("Failed to build HAMT: %v", err)
 	}
@@ -304,7 +304,7 @@ func TestCheckManager_ContentRef_HMACPath(t *testing.T) {
 
 	// HAMT tree + snapshot
 	directTree := hamt.NewTree(mockStore)
-	rootRef, err := directTree.Insert("", "hmac-file", metaRef)
+	rootRef, err := directTree.Insert("", "", "hmac-file", metaRef)
 	if err != nil {
 		t.Fatalf("Failed to build HAMT: %v", err)
 	}
@@ -358,7 +358,7 @@ func TestCheckManager_CorruptChunk_HMACReadData(t *testing.T) {
 	_ = mockStore.Put(ctx, metaRef, metaData)
 
 	directTree := hamt.NewTree(mockStore)
-	rootRef, err := directTree.Insert("", "corrupt-hmac-file", metaRef)
+	rootRef, err := directTree.Insert("", "", "corrupt-hmac-file", metaRef)
 	if err != nil {
 		t.Fatalf("Failed to build HAMT: %v", err)
 	}

diff --git a/internal/engine/diff_test.go b/internal/engine/diff_test.go
@@ -78,7 +78,7 @@ func createHamt(t *testing.T, s *MockStore, ids []string, refs []string) string
 	root := ""
 	for i, id := range ids {
 		var err error
-		root, err = tree.Insert(root, id, refs[i])
+		root, err = tree.Insert(root, "", id, refs[i])
 		if err != nil {
 			t.Fatalf("Insert failed: %v", err)
 		}

diff --git a/internal/engine/prune_test.go b/internal/engine/prune_test.go
@@ -34,7 +34,7 @@ func TestPruneManager_Run(t *testing.T) {
 	// HAMT Construction using BackupManager's tree for flushing.
 	src := NewMockSource()
 	bkMgr := NewBackupManager(src, mockStore, ui.NewNoOpReporter(), nil, WithVerbose())
-	rootRef, err := bkMgr.tree.Insert("", "file1", metaRef)
+	rootRef, err := bkMgr.tree.Insert("", "", "file1", metaRef)
 	if err != nil {
 		t.Fatalf("Failed to create hamt: %v", err)
 	}
@@ -45,7 +45,7 @@ func TestPruneManager_Run(t *testing.T) {
 
 	// Also insert directly into mock store (non-transactional).
 	directTree := hamt.NewTree(mockStore)
-	rootRef, err = directTree.Insert("", "file1", metaRef)
+	rootRef, err = directTree.Insert("", "", "file1", metaRef)
 	if err != nil {
 		t.Fatalf("Failed to insert: %v", err)
 	}