Conversation
- Add 7 API endpoints for GitStats analytics integration:
- Org-level: /analytics/summary, /analytics/codebases
- Codebase-level: /codebases/{id}/analytics/overview|branches|activity|ownership|status
- Add AnalyticsService for reading pre-computed JSON from S3
- Super Admins see all codebases
- Source Admins see only administered codebases (filtered)
- Add response schemas matching GitStats export schemas
- Add unit tests for AnalyticsService (15 tests, all passing)
- Add integration tests for authorization (require Docker)
- Authorization coverage test passes
Authorization pattern:
- Org-level: enforce_any_source_admin
- Codebase-level: enforce_asset_action(action_key='asset.manage')
S3 structure: analytics/{codebase_id}/*.json
…Overview schemas Add 4 new fields for frontend Analytics table: - total_contributors: int - total_branches: int - primary_language: str | None - total_churn: int These fields match the updated GitStats export schemas.
Migrate GitStats analytics functionality into python-backend as a new content service with modular pipeline architecture. New analytics content service (content_services/analytics/): - Pipeline orchestrator with 6 phases (clone, extract, branches, store, aggregate, export) - Hot storage (DuckDB) for aggregated metrics - Warm/cold storage (Parquet) for commit and file data - Dual SLOC calculator (line-based + byte-based metrics) - JSON exporter for API consumption - Comprehensive test suite (26 tests, all passing) Hatchet workflow: - analytics_workflow.py for triggering pipeline on codebases - 2-hour timeout, concurrency limits (3 max) Key design decisions: - Use pygit2 directly for commit extraction - Modular phase functions for testability - TDD approach with fixtures from GitStats Test coverage: - Unit tests: SLOC calculator, hot storage, parquet storage - Integration tests: Pipeline phases, end-to-end flow
The previous implementation used diff.patch_from_delta() which doesn't exist in pygit2. Now uses diff.stats for reliable line counts and parses diff.patch for byte calculations. E2E verified on: - sindresorhus/is: 244 commits, 11K additions, 6K deletions - request/request: 2288 commits, 95K additions, 73K deletions
Add test_extract_commits_sloc_metrics to verify that commit extraction correctly populates line and byte metrics from git diffs. This test would have caught the bug where diff.patch_from_delta() (non-existent in pygit2) was silently failing and returning 0 for all SLOC metrics. The test verifies: - additions_lines > 0 for commits with file changes - addition_bytes > 0 when additions_lines > 0 - churn_lines >= additions_lines (consistency check) - files_changed >= 1 for commits with additions
Removed unused code identified during review: Models (records.py): - RepositoryMetadata class - FileChange class - IngestionSummary class Imports: - Unused 'Any' from typing (5 files) - Unused 'branch_info_to_dict' import (orchestrator.py) Storage (parquet_storage.py): - write_branch_snapshots() method - read_branch_snapshots() method Schemas (warm_schemas.py): - BRANCH_SNAPSHOTS_SCHEMA (only used by removed methods) Exceptions (calculator.py): - CalculationError base class (only subclasses were used) All 27 tests passing.
Moved analytics from separate Poetry package to main package: - content_services/analytics/src/* → content_services/src/analytics/ - content_services/analytics/tests/* → content_services/tests/analytics/ Changes: - Updated all imports from 'src.' to 'analytics.' - Removed sys.path hack from analytics_workflow.py - Removed duplicate COPY in Dockerfile - Deleted separate pyproject.toml (deps already in main package) - Added path setup to tests/analytics/conftest.py Benefits: - Consistent with other services (inspector, autodocs) - Simpler import structure - Single test command covers everything - No duplicate dependencies All 27 tests passing.
The pipeline was generating JSON files locally but not uploading them to S3.
This caused 404 errors when the API tried to fetch analytics data.
Changes:
- Import AWSS3Client and org_id_to_hash from shared package
- Add _phase_upload() method to upload JSON files to S3
- Call Phase 7 after JSON export completes
- Upload path: analytics/{codebase_id}/{filename}.json
- Uses organization-specific bucket via org_id_to_hash()
Two fixes for the analytics pipeline: 1. Display Name Fix: - Pass repo_owner and repo_name from pipeline input to aggregation engine - AggregationEngine now stores full_name as 'owner/repo' instead of placeholder - Exported JSON shows actual repository name (e.g., 'lodash/lodash') 2. Org-Level File Updates (Phase 8): - After per-codebase JSON upload, update org-level summary files - _update_codebases_list(): Updates analytics/codebases_list.json - _update_org_summary(): Updates analytics/org_summary.json - New codebases now appear in the main analytics list automatically
PR Summary🔴 Feature | Risk: High | ⏱️ 60+ min | Tests: 🟡 partial Adds a full GitStats analytics pipeline inside content_services and exposes it via new FastAPI endpoints that serve precomputed JSON from S3 with role-based filtering. Includes schema alignment for the frontend, S3 upload + org-level summary refresh, a pygit2 diff-counting fix, and expanded unit/integration test coverage. Risk Assessment🔴 High (5 items) The PR adds new analytics API endpoints and an S3-backed service that returns repository metrics and code-ownership data (including contributor emails). The biggest risks are (1) potential cross-organization data exposure due to missing org scoping when enriching provider metadata from the database, and (2) reliability/DoS risk from unbounded S3 object reads and potentially huge JSON responses. There is also a breaking-change risk around new top-level routes and a data-privacy concern around exposing contributor PII. Critical Files
Questions for Review
Stats📁 20 files | +5383 | -0 View full analysis in Driver | Generated by Driver |
No description provided.