Skip to content

feat: add user_id-based analytics with LEFT JOIN users#551

Merged
ColeMurray merged 3 commits intomainfrom
phase-3/analytics-user-identity
Apr 24, 2026
Merged

feat: add user_id-based analytics with LEFT JOIN users#551
ColeMurray merged 3 commits intomainfrom
phase-3/analytics-user-identity

Conversation

@ColeMurray
Copy link
Copy Markdown
Owner

@ColeMurray ColeMurray commented Apr 24, 2026

Summary

Phase 3 of the unified user model migration (ColeMurray/background-agents#523). Rewrites analytics queries to use the canonical user_id for user attribution via LEFT JOIN users, replacing the previous scm_login-only approach.

  • getSummary: active_users uses COALESCE(user_id, scm_login) — prefers canonical identity, falls back to SCM login for unlinked sessions
  • getTimeseries: LEFT JOIN users for display name labels; groups by user_id when available (merging sessions across different scm_login values for the same user)
  • getBreakdown (by user): LEFT JOIN users, returns displayName from users table; key is user_id (stable canonical ID) with scm_login fallback
  • getBreakdown (by repo): Unchanged — no JOIN needed
  • Shared type: Adds optional displayName to AnalyticsBreakdownEntry

Backward compatibility

All queries are backward-compatible for sessions without user_id (all current sessions, until Phase 4 wires identity forwarding from each bot):

  • Summary activeUsers: COALESCE(user_id, scm_login) → falls back to scm_login → identical count
  • Timeseries group labels: COALESCE(display_name, scm_login, 'unknown') → same labels as today
  • Breakdown keys: COALESCE(user_id, scm_login, '__unknown__') → same keys as today for linked sessions; 'unknown''__unknown__' for the sentinel (frontend will be updated in Phase 6, PR 13)

Depends on

Test plan

  • Integration tests: all 6 analytics tests pass (324 total integration tests)
  • New test: seeds users table + sessions with user_id, verifies merged breakdown entry with displayName from users table, correct activeUsers count, and timeseries group labels
  • Existing tests updated for __unknown__ sentinel and displayName in breakdown responses
  • Unit tests: 994 pass
  • Typecheck: clean across control-plane and web packages

Summary by CodeRabbit

  • New Features

    • Breakdown entries can include optional human-readable user display names.
  • Bug Fixes

    • Active user counts and timeseries now consolidate by canonical user identity rather than raw SCM logins, producing more accurate counts and aggregated series.
    • Timeseries aggregation now sums multiple rows for the same date/label instead of overwriting.
  • Tests

    • Added integration tests for identity consolidation, display-name labeling, and correct timeseries aggregation.

Rewrite analytics queries to use the canonical user_id for user
attribution, with backward-compatible fallback to scm_login for
sessions that don't have a user_id yet.

- getSummary: active_users uses COALESCE(user_id, scm_login)
- getTimeseries: LEFT JOIN users for display_name, GROUP BY user_id
- getBreakdown (user): LEFT JOIN users, displayName in response
- Add displayName field to shared AnalyticsBreakdownEntry type
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

📝 Walkthrough

Walkthrough

Analytics queries and tests were updated to prefer canonical user identity: sessions are joined to users, grouping/deduplication now uses user_id (falling back to non-empty scm_login), timeseries aggregation sums per (date, label), and breakdown entries may include an optional displayName.

Changes

Cohort / File(s) Summary
Analytics Query Logic
packages/control-plane/src/db/analytics-store.ts
Rewrote getSummary, getTimeseries, and getBreakdown to alias sessions as s, LEFT JOIN users u, use COALESCE(u.display_name, s.scm_login, 'unknown') for labels, group/deduplicate by COALESCE(s.user_id, NULLIF(s.scm_login,''), 'unknown'), count distinct user_id for active users, and change timeseries to SUM counts for identical (date, group) keys rather than overwriting.
Analytics Tests
packages/control-plane/test/integration/analytics.test.ts
Added seedUser helper and extended seedSession to optionally write userId; updated expectations to include displayName (e.g., "Unknown user"); added tests validating consolidation by shared userId, distinct userId active-user counting, and summation behavior when multiple userIds share a display name.
Shared Types
packages/shared/src/types/index.ts
Extended AnalyticsBreakdownEntry to add optional displayName?: string so breakdown responses can include human-readable labels returned from the store.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 I hopped through rows and found a clue,

Names now gather where IDs grew true;
Sessions stitched by one bright thread,
Timeseries sums and labels spread.
Rejoice — analytics with a clearer view!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: rewriting analytics queries to use user_id-based identity with LEFT JOIN users, which aligns with the PR's primary objective of implementing Phase 3 of the user model migration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch phase-3/analytics-user-identity

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

Comment thread packages/control-plane/src/db/analytics-store.ts Outdated
Comment thread packages/control-plane/src/db/analytics-store.ts Outdated
Comment thread packages/control-plane/src/db/analytics-store.ts Outdated
Comment thread packages/control-plane/src/db/analytics-store.ts Outdated
open-inspect[bot]
open-inspect Bot previously requested changes Apr 24, 2026
Copy link
Copy Markdown
Contributor

@open-inspect open-inspect Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

PR Title: feat: add user_id-based analytics with LEFT JOIN users (#551)
Author: @ColeMurray
Files changed: 3 files, +178 / -29

This updates analytics queries to attribute sessions by canonical user_id and pull display names from the users table. The new integration coverage is helpful, and the analytics integration suite passes locally, but I found two blocking issues before this is safe to merge.

Critical Issues

  • [Functionality] packages/control-plane/src/db/analytics-store.ts:107 - getTimeseries() labels rows by display name but groups by canonical identity. If two different users share the same display name on the same day, the SQL returns multiple rows with the same rendered label and the reducer overwrites the earlier value instead of summing it. This will undercount chart data for common names. Suggested fix: either group by the rendered label in SQL as well, or merge duplicate labels with additive updates in TypeScript.
  • [Compatibility] packages/control-plane/src/db/analytics-store.ts:143 - getBreakdown(by=user) now returns opaque user_id values and the __unknown__ sentinel in key. The current web consumer still renders and sorts by entry.key and still checks for the old unknown sentinel, so this backend-only change regresses the current UI by exposing raw IDs and dropping the unknown-user treatment. Suggested fix: preserve the old key semantics until the frontend lands, or ship the paired frontend change in the same rollout.

Suggestions

  • [Testing] packages/control-plane/test/integration/analytics.test.ts:630 - Add a coverage case with two distinct users who share the same display name on the same date. That would lock in the expected timeseries behavior and catch the collision bug above.

Nitpicks

  • None.

Positive Feedback

  • The summary query keeps the null/empty-login fallback compact with COALESCE(user_id, NULLIF(scm_login, )), which is a clean way to preserve attribution for unlinked sessions.
  • The new integration test does a good job covering the intended linked-user behavior across breakdown, summary, and timeseries in one place.
  • The repo breakdown path stays isolated from the user-identity join, which keeps the change surface relatively small.

Questions

  • None.

Verdict

Request Changes

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/control-plane/src/db/analytics-store.ts (1)

61-74: ⚠️ Potential issue | 🟡 Minor

Add userId and scmLogin to automation session creation in packages/control-plane/src/scheduler/durable-object.ts line 572.

The sessionStore.create() call is missing both fields, even though automation.created_by is available. Without these, automation-spawned sessions will not be linked to a user in analytics queries, causing the COALESCE(user_id, NULLIF(scm_login, '')) logic in getAnalytics and downstream getBreakdown/getTimeseries to fail to match other sessions from the same user, splitting metrics across separate rows/series and inflating active_user counts.

Router paths (lines 941, 1942) and child session creation already populate both fields correctly; automation path should do the same:

scmLogin: automation.created_by || null,
userId: automation.created_by,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/control-plane/src/db/analytics-store.ts` around lines 61 - 74, The
automation spawn path is calling sessionStore.create() without setting userId
and scmLogin, so automation-created sessions aren't linked to users and the
COALESCE(user_id, NULLIF(scm_login, '')) logic in
getAnalytics/getBreakdown/getTimeseries miscounts active users; update the
automation branch in durable-object.ts (where sessionStore.create(...) is
invoked for automation at the automation spawn path) to pass scmLogin:
automation.created_by || null and userId: automation.created_by so automation
sessions match router/child session records and analytics queries aggregate
correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@packages/control-plane/src/db/analytics-store.ts`:
- Around line 61-74: The automation spawn path is calling sessionStore.create()
without setting userId and scmLogin, so automation-created sessions aren't
linked to users and the COALESCE(user_id, NULLIF(scm_login, '')) logic in
getAnalytics/getBreakdown/getTimeseries miscounts active users; update the
automation branch in durable-object.ts (where sessionStore.create(...) is
invoked for automation at the automation spawn path) to pass scmLogin:
automation.created_by || null and userId: automation.created_by so automation
sessions match router/child session records and analytics queries aggregate
correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa3e98fe-5fdb-427a-bbaa-8e76bd3866f8

📥 Commits

Reviewing files that changed from the base of the PR and between 8eec7f0 and 117078f.

📒 Files selected for processing (3)
  • packages/control-plane/src/db/analytics-store.ts
  • packages/control-plane/test/integration/analytics.test.ts
  • packages/shared/src/types/index.ts

- Fix timeseries reducer to sum duplicate display-name labels instead
  of overwriting (handles two users with the same display name)
- Preserve 'unknown' sentinel in breakdown key for backward compat
  with current web frontend (was changed to '__unknown__')
- Add integration test for display-name collision in timeseries
@github-actions
Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

@ColeMurray
Copy link
Copy Markdown
Owner Author

Re: coderabbit's suggestion to add userId/scmLogin to the automation session creation path —

This is intentionally out of scope for this PR. Automation session creation is Phase 5 (PR 12) in the phased implementation plan. The COALESCE fallback handles automation sessions gracefully — they continue to work identically to pre-migration behavior until Phase 5 wires the userId through.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/control-plane/src/db/analytics-store.ts`:
- Around line 146-150: The selected display_name expression is non-deterministic
because s.scm_login is not aggregated while GROUP BY only covers the key; update
the display_name selection to use aggregates (e.g., MAX(u.display_name) and
MAX(NULLIF(s.scm_login, '')) inside COALESCE) or else include the fallback
expression in the GROUP BY so the label is deterministic; change the
displayNameSelect (and analogous selections at the other occurrences) to use
MAX(u.display_name) and MAX(NULLIF(s.scm_login, '')) in the COALESCE fallback or
add those fallback expressions to GROUP BY to ensure stable results when
u.display_name IS NULL.
- Line 107: The SELECT uses COALESCE(u.display_name, NULLIF(s.scm_login, ''),
'unknown') which treats an empty u.display_name as a valid value and skips the
scm_login fallback; change the expression to COALESCE(NULLIF(u.display_name,
''), NULLIF(s.scm_login, ''), 'unknown') in the SQL (the COALESCE containing
u.display_name and s.scm_login) so empty strings normalize to NULL and the
fallback works — update both occurrences of that COALESCE in analytics-store.ts.
- Line 63: The COUNT(DISTINCT COALESCE(user_id, NULLIF(scm_login, ''))) used for
active_users will double-count the same person across phased rollouts because
migrations adding scm_login (0017) and user_id (0019) are no-backfill; update
documentation and/or add an in-repo note near the analytics logic (referencing
active_users, COALESCE, user_id, scm_login) explaining the expected transient
metric drift during Phase 4 and when Phase 5 backfill/consolidation will resolve
it, or alternatively add a comment indicating planned backfill timing and
rationale so future readers understand the temporary inflation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5310245a-319d-4012-9800-bb2c1942875e

📥 Commits

Reviewing files that changed from the base of the PR and between 117078f and 7f0eed9.

📒 Files selected for processing (2)
  • packages/control-plane/src/db/analytics-store.ts
  • packages/control-plane/test/integration/analytics.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/control-plane/test/integration/analytics.test.ts

Comment thread packages/control-plane/src/db/analytics-store.ts
Comment thread packages/control-plane/src/db/analytics-store.ts Outdated
Comment thread packages/control-plane/src/db/analytics-store.ts
- Use NULLIF(u.display_name, '') to handle empty display names
- Use MAX() aggregates for non-deterministic bare columns in GROUP BY
- Add comment documenting expected active_users metric drift during
  phased rollout (Phase 4→6 window)
@github-actions
Copy link
Copy Markdown

Terraform Validation Results

Step Status
Format
Init
Validate

Note: Terraform plan was skipped because secrets are not configured. This is expected for external contributors. See docs/GETTING_STARTED.md for setup instructions.

Pushed by: @ColeMurray, Action: pull_request

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/control-plane/src/db/analytics-store.ts (1)

156-156: Optional: case-insensitive ordering for display-name sort.

SQLite's default collation is BINARY, so ORDER BY display_name ASC places all uppercase names before lowercase ones (e.g., Zed before alice). For a user-facing breakdown list this is usually jarring. Consider COLLATE NOCASE on the display-name tiebreaker; key ASC is already an opaque id so it's fine to leave alone.

♻️ Proposed tweak
-    const orderTail = isUserBreakdown ? "display_name ASC" : "key ASC";
+    const orderTail = isUserBreakdown ? "display_name COLLATE NOCASE ASC" : "key ASC";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/control-plane/src/db/analytics-store.ts` at line 156, The ORDER BY
tail uses "display_name ASC" when isUserBreakdown is true which sorts
case-sensitively; change the branch that sets orderTail (the variable in
analytics-store.ts used when isUserBreakdown is true) to use a case-insensitive
collation such as "display_name COLLATE NOCASE ASC" instead of "display_name
ASC" so user-facing display_name ordering is case-insensitive while leaving the
"key ASC" branch unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/control-plane/src/db/analytics-store.ts`:
- Line 156: The ORDER BY tail uses "display_name ASC" when isUserBreakdown is
true which sorts case-sensitively; change the branch that sets orderTail (the
variable in analytics-store.ts used when isUserBreakdown is true) to use a
case-insensitive collation such as "display_name COLLATE NOCASE ASC" instead of
"display_name ASC" so user-facing display_name ordering is case-insensitive
while leaving the "key ASC" branch unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bd308fc1-821b-466f-a4ea-6f3d6a8f05ea

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0eed9 and 75206d2.

📒 Files selected for processing (1)
  • packages/control-plane/src/db/analytics-store.ts

@ColeMurray ColeMurray merged commit 5a368cc into main Apr 24, 2026
18 checks passed
@ColeMurray ColeMurray deleted the phase-3/analytics-user-identity branch April 24, 2026 05:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant