GitStats Analytics Pipeline by adamtilton · Pull Request #32 · Driver-Adam-Testing/python-backend-test-h

adamtilton · 2026-01-28T01:51:29Z

No description provided.

- Add 7 API endpoints for GitStats analytics integration: - Org-level: /analytics/summary, /analytics/codebases - Codebase-level: /codebases/{id}/analytics/overview|branches|activity|ownership|status - Add AnalyticsService for reading pre-computed JSON from S3 - Super Admins see all codebases - Source Admins see only administered codebases (filtered) - Add response schemas matching GitStats export schemas - Add unit tests for AnalyticsService (15 tests, all passing) - Add integration tests for authorization (require Docker) - Authorization coverage test passes Authorization pattern: - Org-level: enforce_any_source_admin - Codebase-level: enforce_asset_action(action_key='asset.manage') S3 structure: analytics/{codebase_id}/*.json

…Overview schemas Add 4 new fields for frontend Analytics table: - total_contributors: int - total_branches: int - primary_language: str | None - total_churn: int These fields match the updated GitStats export schemas.

Migrate GitStats analytics functionality into python-backend as a new content service with modular pipeline architecture. New analytics content service (content_services/analytics/): - Pipeline orchestrator with 6 phases (clone, extract, branches, store, aggregate, export) - Hot storage (DuckDB) for aggregated metrics - Warm/cold storage (Parquet) for commit and file data - Dual SLOC calculator (line-based + byte-based metrics) - JSON exporter for API consumption - Comprehensive test suite (26 tests, all passing) Hatchet workflow: - analytics_workflow.py for triggering pipeline on codebases - 2-hour timeout, concurrency limits (3 max) Key design decisions: - Use pygit2 directly for commit extraction - Modular phase functions for testability - TDD approach with fixtures from GitStats Test coverage: - Unit tests: SLOC calculator, hot storage, parquet storage - Integration tests: Pipeline phases, end-to-end flow

The previous implementation used diff.patch_from_delta() which doesn't exist in pygit2. Now uses diff.stats for reliable line counts and parses diff.patch for byte calculations. E2E verified on: - sindresorhus/is: 244 commits, 11K additions, 6K deletions - request/request: 2288 commits, 95K additions, 73K deletions

Add test_extract_commits_sloc_metrics to verify that commit extraction correctly populates line and byte metrics from git diffs. This test would have caught the bug where diff.patch_from_delta() (non-existent in pygit2) was silently failing and returning 0 for all SLOC metrics. The test verifies: - additions_lines > 0 for commits with file changes - addition_bytes > 0 when additions_lines > 0 - churn_lines >= additions_lines (consistency check) - files_changed >= 1 for commits with additions

Removed unused code identified during review: Models (records.py): - RepositoryMetadata class - FileChange class - IngestionSummary class Imports: - Unused 'Any' from typing (5 files) - Unused 'branch_info_to_dict' import (orchestrator.py) Storage (parquet_storage.py): - write_branch_snapshots() method - read_branch_snapshots() method Schemas (warm_schemas.py): - BRANCH_SNAPSHOTS_SCHEMA (only used by removed methods) Exceptions (calculator.py): - CalculationError base class (only subclasses were used) All 27 tests passing.

Moved analytics from separate Poetry package to main package: - content_services/analytics/src/* → content_services/src/analytics/ - content_services/analytics/tests/* → content_services/tests/analytics/ Changes: - Updated all imports from 'src.' to 'analytics.' - Removed sys.path hack from analytics_workflow.py - Removed duplicate COPY in Dockerfile - Deleted separate pyproject.toml (deps already in main package) - Added path setup to tests/analytics/conftest.py Benefits: - Consistent with other services (inspector, autodocs) - Simpler import structure - Single test command covers everything - No duplicate dependencies All 27 tests passing.

The pipeline was generating JSON files locally but not uploading them to S3. This caused 404 errors when the API tried to fetch analytics data. Changes: - Import AWSS3Client and org_id_to_hash from shared package - Add _phase_upload() method to upload JSON files to S3 - Call Phase 7 after JSON export completes - Upload path: analytics/{codebase_id}/{filename}.json - Uses organization-specific bucket via org_id_to_hash()

Two fixes for the analytics pipeline: 1. Display Name Fix: - Pass repo_owner and repo_name from pipeline input to aggregation engine - AggregationEngine now stores full_name as 'owner/repo' instead of placeholder - Exported JSON shows actual repository name (e.g., 'lodash/lodash') 2. Org-Level File Updates (Phase 8): - After per-codebase JSON upload, update org-level summary files - _update_codebases_list(): Updates analytics/codebases_list.json - _update_org_summary(): Updates analytics/org_summary.json - New codebases now appear in the main analytics list automatically

driver-ai-adam-org · 2026-01-28T01:56:39Z

PR Summary

🔴 Feature | Risk: High | ⏱️ 60+ min | Tests: 🟡 partial

Adds a full GitStats analytics pipeline inside content_services and exposes it via new FastAPI endpoints that serve precomputed JSON from S3 with role-based filtering. Includes schema alignment for the frontend, S3 upload + org-level summary refresh, a pygit2 diff-counting fix, and expanded unit/integration test coverage.

Risk Assessment

🔴 High (5 items)

The PR adds new analytics API endpoints and an S3-backed service that returns repository metrics and code-ownership data (including contributor emails). The biggest risks are (1) potential cross-organization data exposure due to missing org scoping when enriching provider metadata from the database, and (2) reliability/DoS risk from unbounded S3 object reads and potentially huge JSON responses. There is also a breaking-change risk around new top-level routes and a data-privacy concern around exposing contributor PII.

⚠️ Security concerns identified - review carefully

⚠️ Breaking changes detected - check migration requirements

Critical Files

content_services/src/analytics/aggregation/engine.py - Core aggregation logic (deduping commits, cumulative metrics, branch metrics) and correctness directly impacts all exported analytics numbers and downstream UI.
content_services/src/analytics/export/exporter.py - Defines the contract to the API by producing JSON files; contains key metric computations (net_sloc/total_lines/ownership joins) and potential schema/field mismatches.
backend/app/services/analytics_service.py - Controls S3 reads, error handling, and authorization-based filtering; any bug can leak data across admins or break analytics access.
backend/app/api/routes/v1/analytics.py - Enforces endpoint-level authorization, 404 behavior, and response schema mapping; must match the intended security model and frontend expectations.
content_services/Dockerfile - Runtime dependency for pygit2 (libgit2-dev)—misconfigurations will break pipeline execution in deployment.

Questions for Review

Do the exported JSON schemas in content_services/src/analytics/export/schemas.py and the API schemas in backend/app/schemas/analytics_schema.py match exactly (including field names like addition_bytes vs total_addition_bytes, and datetime formats/timezones)?
Is the authorization model correct for intended product behavior (org-level: any source admin; codebase-level: asset.manage only)? Should some analytics be visible to asset_members/readers?
How are org-level files (analytics/org_summary.json, analytics/codebases_list.json) updated safely under concurrency (multiple pipelines running) to avoid lost updates or partial writes? Any locking/versioning?

Stats

📁 20 files | +5383 | -0

View full analysis in Driver | Generated by Driver

adamtilton added 9 commits January 27, 2026 19:51

adamtilton closed this Feb 3, 2026

adamtilton deleted the test/feature branch February 3, 2026 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitStats Analytics Pipeline#32

GitStats Analytics Pipeline#32
adamtilton wants to merge 9 commits intodevelopfrom
test/feature

adamtilton commented Jan 28, 2026

Uh oh!

driver-ai-adam-org bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adamtilton commented Jan 28, 2026

Uh oh!

driver-ai-adam-org bot commented Jan 28, 2026

PR Summary

Risk Assessment

Critical Files

Questions for Review

Stats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant