Skip to content

GitStats Analytics Pipeline#32

Closed
adamtilton wants to merge 9 commits intodevelopfrom
test/feature
Closed

GitStats Analytics Pipeline#32
adamtilton wants to merge 9 commits intodevelopfrom
test/feature

Conversation

@adamtilton
Copy link
Copy Markdown
Contributor

No description provided.

- Add 7 API endpoints for GitStats analytics integration:
  - Org-level: /analytics/summary, /analytics/codebases
  - Codebase-level: /codebases/{id}/analytics/overview|branches|activity|ownership|status
- Add AnalyticsService for reading pre-computed JSON from S3
  - Super Admins see all codebases
  - Source Admins see only administered codebases (filtered)
- Add response schemas matching GitStats export schemas
- Add unit tests for AnalyticsService (15 tests, all passing)
- Add integration tests for authorization (require Docker)
- Authorization coverage test passes

Authorization pattern:
- Org-level: enforce_any_source_admin
- Codebase-level: enforce_asset_action(action_key='asset.manage')

S3 structure: analytics/{codebase_id}/*.json
…Overview schemas

Add 4 new fields for frontend Analytics table:
- total_contributors: int
- total_branches: int
- primary_language: str | None
- total_churn: int

These fields match the updated GitStats export schemas.
Migrate GitStats analytics functionality into python-backend as a new
content service with modular pipeline architecture.

New analytics content service (content_services/analytics/):
- Pipeline orchestrator with 6 phases (clone, extract, branches, store, aggregate, export)
- Hot storage (DuckDB) for aggregated metrics
- Warm/cold storage (Parquet) for commit and file data
- Dual SLOC calculator (line-based + byte-based metrics)
- JSON exporter for API consumption
- Comprehensive test suite (26 tests, all passing)

Hatchet workflow:
- analytics_workflow.py for triggering pipeline on codebases
- 2-hour timeout, concurrency limits (3 max)

Key design decisions:
- Use pygit2 directly for commit extraction
- Modular phase functions for testability
- TDD approach with fixtures from GitStats

Test coverage:
- Unit tests: SLOC calculator, hot storage, parquet storage
- Integration tests: Pipeline phases, end-to-end flow
The previous implementation used diff.patch_from_delta() which doesn't
exist in pygit2. Now uses diff.stats for reliable line counts and
parses diff.patch for byte calculations.

E2E verified on:
- sindresorhus/is: 244 commits, 11K additions, 6K deletions
- request/request: 2288 commits, 95K additions, 73K deletions
Add test_extract_commits_sloc_metrics to verify that commit extraction
correctly populates line and byte metrics from git diffs.

This test would have caught the bug where diff.patch_from_delta()
(non-existent in pygit2) was silently failing and returning 0 for all
SLOC metrics.

The test verifies:
- additions_lines > 0 for commits with file changes
- addition_bytes > 0 when additions_lines > 0
- churn_lines >= additions_lines (consistency check)
- files_changed >= 1 for commits with additions
Removed unused code identified during review:

Models (records.py):
- RepositoryMetadata class
- FileChange class
- IngestionSummary class

Imports:
- Unused 'Any' from typing (5 files)
- Unused 'branch_info_to_dict' import (orchestrator.py)

Storage (parquet_storage.py):
- write_branch_snapshots() method
- read_branch_snapshots() method

Schemas (warm_schemas.py):
- BRANCH_SNAPSHOTS_SCHEMA (only used by removed methods)

Exceptions (calculator.py):
- CalculationError base class (only subclasses were used)

All 27 tests passing.
Moved analytics from separate Poetry package to main package:
- content_services/analytics/src/* → content_services/src/analytics/
- content_services/analytics/tests/* → content_services/tests/analytics/

Changes:
- Updated all imports from 'src.' to 'analytics.'
- Removed sys.path hack from analytics_workflow.py
- Removed duplicate COPY in Dockerfile
- Deleted separate pyproject.toml (deps already in main package)
- Added path setup to tests/analytics/conftest.py

Benefits:
- Consistent with other services (inspector, autodocs)
- Simpler import structure
- Single test command covers everything
- No duplicate dependencies

All 27 tests passing.
The pipeline was generating JSON files locally but not uploading them to S3.
This caused 404 errors when the API tried to fetch analytics data.

Changes:
- Import AWSS3Client and org_id_to_hash from shared package
- Add _phase_upload() method to upload JSON files to S3
- Call Phase 7 after JSON export completes
- Upload path: analytics/{codebase_id}/{filename}.json
- Uses organization-specific bucket via org_id_to_hash()
Two fixes for the analytics pipeline:

1. Display Name Fix:
   - Pass repo_owner and repo_name from pipeline input to aggregation engine
   - AggregationEngine now stores full_name as 'owner/repo' instead of placeholder
   - Exported JSON shows actual repository name (e.g., 'lodash/lodash')

2. Org-Level File Updates (Phase 8):
   - After per-codebase JSON upload, update org-level summary files
   - _update_codebases_list(): Updates analytics/codebases_list.json
   - _update_org_summary(): Updates analytics/org_summary.json
   - New codebases now appear in the main analytics list automatically
@driver-ai-adam-org
Copy link
Copy Markdown

PR Summary

🔴 Feature | Risk: High | ⏱️ 60+ min | Tests: 🟡 partial

Adds a full GitStats analytics pipeline inside content_services and exposes it via new FastAPI endpoints that serve precomputed JSON from S3 with role-based filtering. Includes schema alignment for the frontend, S3 upload + org-level summary refresh, a pygit2 diff-counting fix, and expanded unit/integration test coverage.

Risk Assessment

🔴 High (5 items)

The PR adds new analytics API endpoints and an S3-backed service that returns repository metrics and code-ownership data (including contributor emails). The biggest risks are (1) potential cross-organization data exposure due to missing org scoping when enriching provider metadata from the database, and (2) reliability/DoS risk from unbounded S3 object reads and potentially huge JSON responses. There is also a breaking-change risk around new top-level routes and a data-privacy concern around exposing contributor PII.

⚠️ Security concerns identified - review carefully

⚠️ Breaking changes detected - check migration requirements

Critical Files

  • content_services/src/analytics/aggregation/engine.py - Core aggregation logic (deduping commits, cumulative metrics, branch metrics) and correctness directly impacts all exported analytics numbers and downstream UI.
  • content_services/src/analytics/export/exporter.py - Defines the contract to the API by producing JSON files; contains key metric computations (net_sloc/total_lines/ownership joins) and potential schema/field mismatches.
  • backend/app/services/analytics_service.py - Controls S3 reads, error handling, and authorization-based filtering; any bug can leak data across admins or break analytics access.
  • backend/app/api/routes/v1/analytics.py - Enforces endpoint-level authorization, 404 behavior, and response schema mapping; must match the intended security model and frontend expectations.
  • content_services/Dockerfile - Runtime dependency for pygit2 (libgit2-dev)—misconfigurations will break pipeline execution in deployment.

Questions for Review

  • Do the exported JSON schemas in content_services/src/analytics/export/schemas.py and the API schemas in backend/app/schemas/analytics_schema.py match exactly (including field names like addition_bytes vs total_addition_bytes, and datetime formats/timezones)?
  • Is the authorization model correct for intended product behavior (org-level: any source admin; codebase-level: asset.manage only)? Should some analytics be visible to asset_members/readers?
  • How are org-level files (analytics/org_summary.json, analytics/codebases_list.json) updated safely under concurrency (multiple pipelines running) to avoid lost updates or partial writes? Any locking/versioning?

Stats

📁 20 files | +5383 | -0


View full analysis in Driver | Generated by Driver

@adamtilton adamtilton closed this Feb 3, 2026
@adamtilton adamtilton deleted the test/feature branch February 3, 2026 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant