Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions docs/affinity-benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Affinity Model Benchmark Results

Benchmark script: `scripts/benchmark/affinity.sh`
Go unit test: `internal/hamt/affinity_bench_test.go`

---

## E2E benchmark — `scripts/benchmark/affinity.sh`

Two scenarios with **50 modified files each**, applied to the same initial tree
(50 dirs × 50 files = 2 500 files), run as a second backup after the initial full backup.

| Scenario | Change pattern | New HAMT node objects (affinity) | New HAMT node objects (legacy) |
|---|---|---|---|
| A — clustered | All 50 changes in `dir_01` | **18** | 75 |
| B — scattered | 1 change in each of 50 dirs | 171 | 73 |

The metric is **new `node/*` objects** written to the local store during the second
backup. `KeyCacheStore` deduplicates writes: nodes whose content did not change are
skipped, so only genuinely new HAMT path nodes reach the underlying store.

> **Note:** The `Flushing HAMT: X reachable nodes` progress line shows **staging size**
> (the full final tree), not delta writes. Do not use it to judge incremental cost.
> The benchmark counts `node/*` entries in `index/packs` before/after the second backup.

### Cross-binary comparison

```
# Affinity binary (RFC 0002)
Scenario A — clustered (50 files in 1 dir): 18 new node objects
Scenario B — scattered (1 file in 50 dirs): 171 new node objects
Node-write reduction (A vs B): 89.5% (153 fewer writes)

# Legacy binary (pre-RFC 0002)
Scenario A — clustered (50 files in 1 dir): 75 new node objects
Scenario B — scattered (1 file in 50 dirs): 73 new node objects
Difference: ~-2.7% (no meaningful locality benefit)
```

Run the comparison yourself:

```bash
./scripts/benchmark/affinity.sh # current build
CLOUDSTIC_BIN=/path/to/old-cloudstic ./scripts/benchmark/affinity.sh
```

### Why clustered beats scattered (affinity model)

With `AffinityKey(parentID, fileID) = SHA256(parentID)[:4] + SHA256(fileID)[4:]`,
all 50 files in `dir_01` share the same routing at HAMT levels 0–2 (determined by
`SHA256("dir_01")[:4]`). They diverge only at level 3. An incremental update rewrites:

- 1 root + 3 internal path nodes (L0 → L1 → L2 → L3) shared across all 50 files
- ~14 L3 leaf nodes (one per occupied bucket at the divergence level)
- Total: **~18 new nodes**

For scattered changes (1 file per directory), each file traverses a different path
from root — 50 distinct root-to-leaf paths are dirtied. Because affinity keys cluster
same-dir files into deeper sub-trees, those cross-dir paths are also longer, so
scattered writes are higher with affinity (171) than with legacy keys (73). This is an
expected trade-off: affinity optimises the common case (changes concentrated in a
directory) at the cost of slightly worse worst-case (fully scattered changes).

### Why legacy shows no difference between A and B

`SHA256(fileID)` distributes all keys uniformly across the HAMT regardless of which
directory a file lives in. Clustered changes and scattered changes both dirty ~30 of 32
L0 buckets. There is no shared path to exploit, so both scenarios produce roughly the
same number of new node writes (~70–75).

---

## Go unit test — `TestAffinityNodeWriteReduction`

Simulates an incremental backup: build a 1 000-file tree (10 dirs × 100 files),
then update all 100 files in one directory. Only the changed files touch new HAMT paths.

```bash
go test ./internal/hamt/ -run TestAffinityNodeWriteReduction -v
```

Result:

```
Incremental update of 100 files in one directory (1000 total files, 10 dirs):
affinity keys : 20 node writes
legacy keys : 68 node writes
reduction : 70.6% (48 fewer writes)
```

The legacy simulation uses `AffinityKey(fileID, fileID) = SHA256(fileID)`, identical
to the old `computePathKey(fileID)` — no code changes needed.

### Go benchmark — `BenchmarkIncrementalUpdate_*`

```bash
go test ./internal/hamt/ -run=^$ -bench=BenchmarkIncrementalUpdate -benchmem -benchtime=3s
```

| Strategy | ns/op | B/op | allocs/op |
|---|---|---|---|
| Affinity | 2 171 472 | 1 254 035 | 14 363 |
| Legacy | 3 812 540 | 1 968 168 | 15 992 |

**~1.75× faster, ~36% less memory** for a 100-file incremental update in one directory.

---

## Summary

| Metric | Legacy | Affinity | Delta |
|---|---|---|---|
| E2E: clustered 50-file update | 75 nodes | 18 nodes | **−76%** |
| E2E: scattered 50-file update | 73 nodes | 171 nodes | +134% (expected trade-off) |
| E2E: initial tree size (50×50) | 962 nodes | 906 nodes | −6% |
| Unit test: 100-file update, 1 dir | 68 nodes | 20 nodes | **−71%** |
| Unit test wall time | 3 813 µs | 2 171 µs | **−43%** |
| Unit test memory | 1 968 KB | 1 254 KB | **−36%** |

The affinity model's benefit is specifically for **incremental updates of multiple files
in the same directory** — the dominant pattern in real workups. The scattered case (one
change spread across every directory simultaneously) is a pathological pattern that
affinity does not optimise for.
6 changes: 4 additions & 2 deletions internal/core/models.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,9 @@ type HAMTNode struct {

// LeafEntry represents an entry in a Leaf node
type LeafEntry struct {
Key string `json:"key"` // FileID
FileMeta string `json:"filemeta"` // "filemeta/<sha256>"
Key string `json:"key"` // FileID
PathKey string `json:"path_key,omitempty"` // AffinityKey routing key; falls back to SHA256(Key) if empty
FileMeta string `json:"filemeta"` // "filemeta/<sha256>"
}

// SourceInfo describes the origin of a backup snapshot. It is stored as a
Expand All @@ -87,6 +88,7 @@ type Snapshot struct {
Tags []string `json:"tags,omitempty"`
ChangeToken string `json:"change_token,omitempty"`
ExcludeHash string `json:"exclude_hash,omitempty"`
HAMTVersion int `json:"hamt_version,omitempty"` // 1 = legacy, 2 = affinity keys
}

// Index represents a pointer to the latest snapshot
Expand Down
3 changes: 3 additions & 0 deletions internal/engine/backup.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ type BackupManager struct {
metaCacheMu sync.RWMutex
metaCache map[string]core.FileMeta
pendingMetas map[string][]byte // deferred filemeta PUTs (ref → JSON)
parentIndex map[string]string // fileID → primary parent fileID (for AffinityKey lookups)
hmacKey []byte
}

Expand Down Expand Up @@ -122,6 +123,7 @@ func NewBackupManager(src source.Source, dest store.ObjectStore, reporter ui.Rep
newMetas: make(map[string]core.FileMeta),
metaCache: make(map[string]core.FileMeta),
pendingMetas: make(map[string][]byte),
parentIndex: make(map[string]string),
hmacKey: hmacKey,
}
}
Expand Down Expand Up @@ -321,6 +323,7 @@ func (bm *BackupManager) saveSnapshot(ctx context.Context, root string, seq int,
Meta: meta,
ChangeToken: changeToken,
ExcludeHash: bm.cfg.excludeHash,
HAMTVersion: 2,
Comment thread
rmanibus marked this conversation as resolved.
}

hash, snapData, err := core.ComputeJSONHash(&snap)
Expand Down
33 changes: 27 additions & 6 deletions internal/engine/backup_scan.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,25 @@ type scanState struct {
totalBytes int64
}

// primaryParentID returns the raw source-level parent identifier for a FileMeta.
// This is the first element of meta.Parents, which contains raw source IDs (e.g. GDrive folder IDs).
// Returns "" for root-level entries with no parents.
func primaryParentID(meta *core.FileMeta) string {
if len(meta.Parents) > 0 {
return meta.Parents[0]
}
return ""
}

func (bm *BackupManager) processEntry(ctx context.Context, meta *core.FileMeta, oldRoot string, s *scanState, phase ui.Phase) error {
if meta.Type == core.FileTypeFolder {
meta.ContentHash = ""
meta.Size = 0
}

// Record this entry's parent so lookupMetaByFileID can use AffinityKey.
bm.parentIndex[meta.FileID] = primaryParentID(meta)

// Resolve Paths when the source hasn't populated it (incremental/changes
// sources only emit changed entries and can't build a full path map).
if len(meta.Paths) == 0 {
Expand All @@ -57,7 +70,7 @@ func (bm *BackupManager) processEntry(ctx context.Context, meta *core.FileMeta,

if !changed {
bm.recordStat(meta.Type, false, false)
s.root, err = bm.tree.Insert(s.root, meta.FileID, oldRef)
s.root, err = bm.tree.Insert(s.root, primaryParentID(meta), meta.FileID, oldRef)
if err != nil {
return fmt.Errorf("hamt insert: %w", err)
}
Expand Down Expand Up @@ -106,7 +119,7 @@ func (bm *BackupManager) scanIncremental(ctx context.Context, oldRoot string, in
switch fc.Type {
case source.ChangeDelete:
bm.recordRemoved(fc.Meta.Type)
s.root, err = bm.tree.Delete(s.root, fc.Meta.FileID)
s.root, err = bm.tree.Delete(s.root, primaryParentID(&fc.Meta), fc.Meta.FileID)
Comment thread
rmanibus marked this conversation as resolved.
if err != nil {
return fmt.Errorf("hamt delete %s: %w", fc.Meta.FileID, err)
}
Expand All @@ -132,7 +145,7 @@ func (bm *BackupManager) scanIncremental(ctx context.Context, oldRoot string, in
// fast-path compares observable metadata and carries the hash forward to avoid
// false-positive diffs.
func (bm *BackupManager) detectChange(oldRoot string, meta *core.FileMeta) (changed bool, oldRef string, err error) {
oldRef, err = bm.tree.Lookup(oldRoot, meta.FileID)
oldRef, err = bm.tree.Lookup(oldRoot, primaryParentID(meta), meta.FileID)
if err != nil {
return false, "", fmt.Errorf("hamt lookup: %w", err)
}
Expand Down Expand Up @@ -188,7 +201,7 @@ func (bm *BackupManager) insertFolder(_ context.Context, root string, meta *core
bm.pendingMetas[metaRef] = metaData
}
bm.trackFileMeta(metaRef, *meta)
return bm.tree.Insert(root, meta.FileID, metaRef)
return bm.tree.Insert(root, primaryParentID(meta), meta.FileID, metaRef)
}

func (bm *BackupManager) flushPendingMetas(ctx context.Context) error {
Expand Down Expand Up @@ -280,10 +293,18 @@ func (bm *BackupManager) buildPathFromTree(root string, meta *core.FileMeta) str

// lookupMetaByFileID resolves a FileID to its FileMeta via the HAMT tree.
// It checks newMetas (just inserted this scan) first, then falls back to the store.
// Uses parentIndex to resolve the AffinityKey; falls back to a full-tree walk
// for entries not yet seen in this scan (e.g. incremental backups).
func (bm *BackupManager) lookupMetaByFileID(root, fileID string) *core.FileMeta {
ref, err := bm.tree.Lookup(root, fileID)
parentID := bm.parentIndex[fileID]
ref, err := bm.tree.Lookup(root, parentID, fileID)
if err != nil || ref == "" {
return nil
// parentID not in index (e.g. entry from a previous snapshot not re-scanned);
// fall back to a walk-based lookup.
ref, err = bm.tree.LookupByFileID(root, fileID)
if err != nil || ref == "" {
return nil
}
}
if fm, ok := bm.newMetas[ref]; ok {
return &fm
Expand Down
12 changes: 6 additions & 6 deletions internal/engine/backup_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,9 @@ func TestBackupManager_ResolvesPathsForOpaqueIDs(t *testing.T) {
readStore := store.NewCompressedStore(dest)
tree := hamt.NewTree(readStore)

checkPath := func(fileID, expectedPath string) {
checkPath := func(parentID, fileID, expectedPath string) {
t.Helper()
ref, err := tree.Lookup(result.Root, fileID)
ref, err := tree.Lookup(result.Root, parentID, fileID)
if err != nil || ref == "" {
t.Fatalf("Lookup %s: ref=%q err=%v", fileID, ref, err)
}
Expand All @@ -81,9 +81,9 @@ func TestBackupManager_ResolvesPathsForOpaqueIDs(t *testing.T) {
}
}

checkPath("FOLDER_A", "Documents")
checkPath("FOLDER_B", "Documents/Photos")
checkPath("FILE_C", "Documents/Photos/pic.jpg")
checkPath("", "FOLDER_A", "Documents")
checkPath("FOLDER_A", "FOLDER_B", "Documents/Photos")
checkPath("FOLDER_B", "FILE_C", "Documents/Photos/pic.jpg")
}

func TestBackupManager_Run(t *testing.T) {
Expand All @@ -105,7 +105,7 @@ func TestBackupManager_Run(t *testing.T) {
lookupMeta := func(root, key string) *core.FileMeta {
t.Helper()
tree := hamt.NewTree(readStore)
ref, err := tree.Lookup(root, key)
ref, err := tree.Lookup(root, "", key)
if err != nil {
t.Fatalf("Lookup %s: %v", key, err)
}
Expand Down
4 changes: 3 additions & 1 deletion internal/engine/backup_upload.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ var inlineBufferPool = sync.Pool{

type uploadResult struct {
fileID string
parentID string // primary parent's raw fileID (for AffinityKey)
ref string
meta core.FileMeta
contentRef string // content key to cache (empty when dedup'd)
Expand Down Expand Up @@ -91,7 +92,7 @@ func (bm *BackupManager) upload(ctx context.Context, pending []core.FileMeta, to
phase.Error()
return "", res.err
}
root, err = bm.tree.Insert(root, res.fileID, res.ref)
root, err = bm.tree.Insert(root, res.parentID, res.fileID, res.ref)
if err != nil {
phase.Error()
return "", fmt.Errorf("hamt insert: %w", err)
Expand Down Expand Up @@ -128,6 +129,7 @@ func (bm *BackupManager) processFile(ctx context.Context, meta core.FileMeta, ph
}
return uploadResult{
fileID: meta.FileID,
parentID: primaryParentID(&meta),
ref: metaRef,
meta: meta,
contentRef: contentRef,
Expand Down
6 changes: 3 additions & 3 deletions internal/engine/check_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ func buildTestRepo(t *testing.T, mockStore *MockStore) (snapRef, rootRef, metaRe

// HAMT tree
directTree := hamt.NewTree(mockStore)
rootRef, err := directTree.Insert("", "file1", metaRef)
rootRef, err := directTree.Insert("", "", "file1", metaRef)
if err != nil {
t.Fatalf("Failed to build HAMT: %v", err)
}
Expand Down Expand Up @@ -304,7 +304,7 @@ func TestCheckManager_ContentRef_HMACPath(t *testing.T) {

// HAMT tree + snapshot
directTree := hamt.NewTree(mockStore)
rootRef, err := directTree.Insert("", "hmac-file", metaRef)
rootRef, err := directTree.Insert("", "", "hmac-file", metaRef)
if err != nil {
t.Fatalf("Failed to build HAMT: %v", err)
}
Expand Down Expand Up @@ -358,7 +358,7 @@ func TestCheckManager_CorruptChunk_HMACReadData(t *testing.T) {
_ = mockStore.Put(ctx, metaRef, metaData)

directTree := hamt.NewTree(mockStore)
rootRef, err := directTree.Insert("", "corrupt-hmac-file", metaRef)
rootRef, err := directTree.Insert("", "", "corrupt-hmac-file", metaRef)
if err != nil {
t.Fatalf("Failed to build HAMT: %v", err)
}
Expand Down
2 changes: 1 addition & 1 deletion internal/engine/diff_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ func createHamt(t *testing.T, s *MockStore, ids []string, refs []string) string
root := ""
for i, id := range ids {
var err error
root, err = tree.Insert(root, id, refs[i])
root, err = tree.Insert(root, "", id, refs[i])
if err != nil {
t.Fatalf("Insert failed: %v", err)
}
Expand Down
4 changes: 2 additions & 2 deletions internal/engine/prune_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ func TestPruneManager_Run(t *testing.T) {
// HAMT Construction using BackupManager's tree for flushing.
src := NewMockSource()
bkMgr := NewBackupManager(src, mockStore, ui.NewNoOpReporter(), nil, WithVerbose())
rootRef, err := bkMgr.tree.Insert("", "file1", metaRef)
rootRef, err := bkMgr.tree.Insert("", "", "file1", metaRef)
if err != nil {
t.Fatalf("Failed to create hamt: %v", err)
}
Expand All @@ -45,7 +45,7 @@ func TestPruneManager_Run(t *testing.T) {

// Also insert directly into mock store (non-transactional).
directTree := hamt.NewTree(mockStore)
rootRef, err = directTree.Insert("", "file1", metaRef)
rootRef, err = directTree.Insert("", "", "file1", metaRef)
if err != nil {
t.Fatalf("Failed to insert: %v", err)
}
Expand Down
Loading
Loading