Skip to content

feat(data-warehouse): add ClickHouse source#53601

Merged
danielcarletti merged 21 commits intomasterfrom
dc-feature-clickhouse-source
Apr 20, 2026
Merged

feat(data-warehouse): add ClickHouse source#53601
danielcarletti merged 21 commits intomasterfrom
dc-feature-clickhouse-source

Conversation

@danielcarletti
Copy link
Copy Markdown
Contributor

@danielcarletti danielcarletti commented Apr 7, 2026

Problem

PostHog's data warehouse supports a long list of sources but not ClickHouse itself. Users running their own ClickHouse deployments want to pull that data into PostHog without building a custom pipeline, and the target databases are often multi-terabyte and multi-billion row — the import path has to stream, not buffer.

Changes

Adds a new warehouse source for ClickHouse under posthog/temporal/data_imports/sources/clickhouse/, following the same split as the Postgres source (source.py for registration/form fields/validation, clickhouse.py for transport). Designed up front to scale to very large databases.

Scalability:

  • Uses clickhouse-connect with query_arrow_stream, so data flows as a sequence of pa.RecordBatch chunks sized by ClickHouse's max_block_size. Batches are accumulated into ~100k-row / 200 MiB pa.Tables via pa.Table.from_batches before yielding, so Delta sees fewer, larger commits. Memory per worker is bounded regardless of table size.
  • Row counts come for free from system.tables.total_rows for MergeTree tables — no SELECT COUNT(*) on multi-billion row tables. Distributed tables fall back to SELECT count() (cheap, distributed). MaterializedViews resolve to their TO target's total_rows or their .inner_id.<uuid> inner table. Plain views and no-counter engines (Memory/Buffer/Log/Kafka/URL) are reported as "Skipped" with an explanatory tooltip in the UI.
  • Partition sizing is computed from system.tables.total_bytes / total_rows rather than sampling the table, targeting DEFAULT_PARTITION_TARGET_SIZE_IN_BYTES (200 MiB) per partition.
  • Server-side settings: output_format_arrow_string_as_string=1, output_format_arrow_low_cardinality_as_dictionary=0, optimize_read_in_order=1, max_bytes_before_external_sort=500 MiB, max_execution_time, tunable max_block_size.

Arrow compatibility:

  • ClickHouse refuses to emit several types as Arrow (error 50 on UUID, IPv4/6, wide ints, Enum*, FixedString, Array, Map, Tuple, Nested, Variant, Dynamic, JSON, Object). We now build an explicit SELECT list and wrap those columns in toString(col) AS col so the stream never crashes. Type mapping handles Nullable/LowCardinality wrappers, DateTime/DateTime64 with precision + timezone, Decimal[32-256], Date/Date32, signed/unsigned ints up to 64-bit (wide Int128/256 fall back to string), Enum8/16, and composites serialized to string.

Schema discovery:

  • Single round-trip to system.columns for the whole database.
  • Primary key comes from is_in_sorting_key on system.columns. Because ClickHouse's sorting key is not necessarily unique, every incremental sync runs _has_duplicate_primary_keys first (bounded-prefix probe with max_rows_to_read + read_overflow_mode='break').
  • View vs. materialized-view detection via system.tables.engine.
  • Discovery/query log lines run at info level so they surface on the syncs tab.

Incremental sync:

  • Supports integer (Int8-Int256, UInt8-UInt256) and temporal (Date, Date32, DateTime, DateTime64) cursor fields.
  • Query builder uses parameterized queries (%(last_value)s) — only validated, backtick-quoted identifiers land in the SQL string. Identifier quoting escapes embedded backticks and rejects null bytes.

Connection options: host, port, database, user, password (optional), HTTPS toggle, SSL-verify toggle, optional SSH tunnel. SSH tunnel works transparently because we use HTTP(S).

Registration / plumbing:

  • Added CLICKHOUSE to ExternalDataSourceType in products/data_warehouse/backend/types.py, posthog/schema.py (via schema:build), and frontend/src/queries/schema/schema-general.ts.
  • New Django migration 0042_alter_externaldatasource_source_type adds the choice to the model.
  • Registered ClickHouseSource in posthog/temporal/data_imports/sources/__init__.py.
  • ClickHouseSourceConfig regenerated via generate:source-configs.
  • Frontend: SchemaForm renders "Skipped" with a tooltip instead of "Unknown" when row count is unavailable, explaining that counting would require a full scan.

How did you test this code?

This PR was authored by an agent (Claude Code). Verification so far is code-level plus one round of manual smoke-testing against a local ClickHouse:

  • 141 unit tests in test_clickhouse.py covering identifier quoting, type modifier stripping (Nullable/LowCardinality), incremental-field filtering across every supported CH type, query-builder output (including toString wrapping of Arrow-incompatible types), ClickHouseColumn → pa.Field mapping for all supported types (including DateTime64 precision/timezone, Decimal[32-256], wide ints, enums, composites), non-retryable error pattern matching, error translation, schema grouping (mocked client), validate_credentials error paths, batch-accumulation boundaries in get_rows, MV target parsing (qualified/backticked/unqualified/none), and get_clickhouse_row_count across MergeTree/Distributed/MV-with-TO/MV-inner/View paths. All 141 passing.
  • The existing Postgres test suite still passes, confirming no regression in shared types/configs.
  • Smoke-tested source loading under Django: class is registered, config class generates correctly, form fields render.
  • Manual testing against a local ClickHouse — table list renders, Distributed row counts populate via count() fallback, views render as "Skipped":
image

Follow-ups for merging:

  • Drop a clickhouse.png icon into frontend/public/services/ (currently references a placeholder path).
  • Write a docs page at posthog.com/docs/cdp/sources/clickhouse (URL referenced in docsUrl).

Publish to changelog?

Yes — new warehouse source.

🤖 LLM context

Authored by Claude Code (Opus 4.6, 1M context) across multiple sessions. The agent read the existing Postgres, Snowflake, and MySQL sources as reference, followed the implementing-warehouse-sources skill, and chose clickhouse-connect's query_arrow_stream over clickhouse-driver specifically for the Arrow streaming path (which bounds memory on huge tables). Later sessions hardened the source after discovering real-world failures: ClickHouse's Arrow output refuses several common types (error 50), query_arrow_stream yields RecordBatch not Table, and system.tables.total_rows is NULL for Distributed tables and MaterializedViews — all now handled.

Adds a data warehouse source for ClickHouse, built for scalability with
very large databases via clickhouse-connect's streaming Arrow reader,
free row/byte counts from system.tables, and sorting-key-based primary
key discovery. Supports HTTPS, SSL verification toggle, and SSH tunnel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

Size Change: +91 B (0%)

Total Size: 129 MB

ℹ️ View Unchanged
Filename Size Change
frontend/dist/368Hedgehogs 5.26 kB 0 B
frontend/dist/abap 14.2 kB 0 B
frontend/dist/AccountSocialConnected 2.17 kB 0 B
frontend/dist/Action 23.9 kB 0 B
frontend/dist/Actions 1.02 kB 0 B
frontend/dist/AdvancedActivityLogsScene 35.6 kB 0 B
frontend/dist/AgenticAuthorize 5.25 kB 0 B
frontend/dist/apex 3.95 kB 0 B
frontend/dist/ApprovalDetail 16.2 kB 0 B
frontend/dist/array.full.es5.js 333 kB 0 B
frontend/dist/array.full.js 427 kB 0 B
frontend/dist/array.js 183 kB 0 B
frontend/dist/AsyncMigrations 13.1 kB 0 B
frontend/dist/AuthorizationStatus 716 B 0 B
frontend/dist/azcli 846 B 0 B
frontend/dist/bat 1.84 kB 0 B
frontend/dist/BatchExportScene 60.5 kB 0 B
frontend/dist/bicep 2.55 kB 0 B
frontend/dist/Billing 493 B 0 B
frontend/dist/BillingSection 20.8 kB 0 B
frontend/dist/BoxPlot 5.04 kB 0 B
frontend/dist/browserAll-0QZMN1W2 37.4 kB 0 B
frontend/dist/ButtonPrimitives 562 B 0 B
frontend/dist/CalendarHeatMap 4.79 kB 0 B
frontend/dist/cameligo 2.18 kB 0 B
frontend/dist/changeRequestsLogic 544 B 0 B
frontend/dist/CLIAuthorize 11.3 kB 0 B
frontend/dist/CLILive 3.97 kB 0 B
frontend/dist/clojure 9.64 kB 0 B
frontend/dist/coffee 3.59 kB 0 B
frontend/dist/Cohort 24.8 kB 0 B
frontend/dist/CohortCalculationHistory 6.22 kB 0 B
frontend/dist/Cohorts 9.39 kB 0 B
frontend/dist/ConfirmOrganization 4.48 kB 0 B
frontend/dist/conversations.js 65.8 kB 0 B
frontend/dist/Coupons 720 B 0 B
frontend/dist/cpp 5.3 kB 0 B
frontend/dist/Create 655 B 0 B
frontend/dist/crisp-chat-integration.js 1.88 kB 0 B
frontend/dist/csharp 4.52 kB 0 B
frontend/dist/csp 1.42 kB 0 B
frontend/dist/css 4.51 kB 0 B
frontend/dist/cssMode 4.15 kB 0 B
frontend/dist/CustomCssScene 3.55 kB 0 B
frontend/dist/CustomerAnalyticsConfigurationScene 1.99 kB 0 B
frontend/dist/CustomerAnalyticsScene 26.4 kB 0 B
frontend/dist/CustomerJourneyBuilderScene 1.69 kB 0 B
frontend/dist/CustomerJourneyTemplatesScene 7.39 kB 0 B
frontend/dist/customizations.full.js 17.9 kB 0 B
frontend/dist/CyclotronJobInputAssignee 1.32 kB 0 B
frontend/dist/CyclotronJobInputBusinessHours 2.71 kB 0 B
frontend/dist/CyclotronJobInputTicketTags 711 B 0 B
frontend/dist/cypher 3.38 kB 0 B
frontend/dist/dart 4.25 kB 0 B
frontend/dist/Dashboard 1.11 kB 0 B
frontend/dist/Dashboards 24.1 kB 0 B
frontend/dist/DataManagementScene 646 B 0 B
frontend/dist/DataPipelinesNewScene 2.32 kB 0 B
frontend/dist/DataWarehouseScene 1.26 kB 0 B
frontend/dist/Deactivated 1.13 kB 0 B
frontend/dist/dead-clicks-autocapture.js 13.1 kB 0 B
frontend/dist/DeadLetterQueue 5.38 kB 0 B
frontend/dist/DebugScene 20 kB 0 B
frontend/dist/decompressionWorker 2.85 kB 0 B
frontend/dist/decompressionWorker.js 2.85 kB 0 B
frontend/dist/DefinitionEdit 7.11 kB 0 B
frontend/dist/DefinitionView 22.7 kB 0 B
frontend/dist/DestinationsScene 2.71 kB 0 B
frontend/dist/dist 575 B 0 B
frontend/dist/dockerfile 1.87 kB 0 B
frontend/dist/EarlyAccessFeature 753 B 0 B
frontend/dist/EarlyAccessFeatures 2.84 kB 0 B
frontend/dist/ecl 5.33 kB 0 B
frontend/dist/EditorScene 891 B 0 B
frontend/dist/elixir 10.3 kB 0 B
frontend/dist/elk.bundled 1.44 MB 0 B
frontend/dist/EmailMFAVerify 2.98 kB 0 B
frontend/dist/EndpointScene 37.5 kB 0 B
frontend/dist/EndpointsScene 22.1 kB 0 B
frontend/dist/ErrorTrackingIssueFingerprintsScene 6.98 kB 0 B
frontend/dist/ErrorTrackingIssueScene 95.6 kB 0 B
frontend/dist/ErrorTrackingScene 22.6 kB 0 B
frontend/dist/EvaluationTemplates 575 B 0 B
frontend/dist/EventsScene 2.57 kB 0 B
frontend/dist/exception-autocapture.js 11.8 kB 0 B
frontend/dist/Experiment 217 kB 0 B
frontend/dist/Experiments 17.7 kB 0 B
frontend/dist/exporter 20.9 MB +26 B (0%)
frontend/dist/exporter.js 20.9 MB +26 B (0%)
frontend/dist/ExportsScene 3.98 kB 0 B
frontend/dist/FeatureFlag 128 kB 0 B
frontend/dist/FeatureFlags 606 B 0 B
frontend/dist/FeatureFlagTemplatesScene 7.03 kB 0 B
frontend/dist/FlappyHog 5.78 kB 0 B
frontend/dist/flow9 1.8 kB 0 B
frontend/dist/freemarker2 16.7 kB 0 B
frontend/dist/fsharp 2.98 kB 0 B
frontend/dist/go 2.65 kB 0 B
frontend/dist/graphql 2.26 kB 0 B
frontend/dist/Group 14.4 kB 0 B
frontend/dist/Groups 3.91 kB 0 B
frontend/dist/GroupsNew 7.34 kB 0 B
frontend/dist/handlebars 7.34 kB 0 B
frontend/dist/hcl 3.59 kB 0 B
frontend/dist/HealthCategoryDetailScene 7.23 kB 0 B
frontend/dist/HealthScene 10.6 kB 0 B
frontend/dist/HeatmapNewScene 4.16 kB 0 B
frontend/dist/HeatmapRecordingScene 3.92 kB 0 B
frontend/dist/HeatmapScene 5.88 kB 0 B
frontend/dist/HeatmapsScene 3.88 kB 0 B
frontend/dist/hls 394 kB 0 B
frontend/dist/HogFunctionScene 58.8 kB 0 B
frontend/dist/HogRepl 7.37 kB 0 B
frontend/dist/html 5.58 kB 0 B
frontend/dist/htmlMode 4.62 kB 0 B
frontend/dist/image-blob-reduce.esm 49.4 kB 0 B
frontend/dist/InboxScene 59.8 kB 0 B
frontend/dist/index 302 kB 0 B
frontend/dist/index.js 302 kB 0 B
frontend/dist/ini 1.1 kB 0 B
frontend/dist/InsightQuickStart 5.42 kB 0 B
frontend/dist/InsightScene 28.9 kB 0 B
frontend/dist/IntegrationsRedirect 733 B 0 B
frontend/dist/intercom-integration.js 1.93 kB 0 B
frontend/dist/InviteSignup 14.4 kB 0 B
frontend/dist/java 3.22 kB 0 B
frontend/dist/javascript 985 B 0 B
frontend/dist/jsonMode 13.9 kB 0 B
frontend/dist/julia 7.22 kB 0 B
frontend/dist/kotlin 3.4 kB 0 B
frontend/dist/lazy 158 kB 0 B
frontend/dist/LegacyPluginScene 26.6 kB 0 B
frontend/dist/LemonTextAreaMarkdown 502 B 0 B
frontend/dist/less 3.9 kB 0 B
frontend/dist/lexon 2.44 kB 0 B
frontend/dist/lib 2.22 kB 0 B
frontend/dist/Link 468 B 0 B
frontend/dist/LinkScene 24.8 kB 0 B
frontend/dist/LinksScene 4.19 kB 0 B
frontend/dist/liquid 4.53 kB 0 B
frontend/dist/LiveDebugger 19.1 kB 0 B
frontend/dist/LiveEventsTable 3.22 kB 0 B
frontend/dist/LLMAnalyticsClusterScene 15.7 kB 0 B
frontend/dist/LLMAnalyticsClustersScene 43.1 kB 0 B
frontend/dist/LLMAnalyticsDatasetScene 19.7 kB 0 B
frontend/dist/LLMAnalyticsDatasetsScene 3.28 kB 0 B
frontend/dist/LLMAnalyticsEvaluation 58.7 kB 0 B
frontend/dist/LLMAnalyticsEvaluationsScene 29.8 kB 0 B
frontend/dist/LLMAnalyticsPlaygroundScene 36.3 kB 0 B
frontend/dist/LLMAnalyticsScene 118 kB 0 B
frontend/dist/LLMAnalyticsSessionScene 13.4 kB 0 B
frontend/dist/LLMAnalyticsTraceScene 129 kB 0 B
frontend/dist/LLMAnalyticsUsers 526 B 0 B
frontend/dist/LLMASessionFeedbackDisplay 4.83 kB 0 B
frontend/dist/LLMPromptScene 17.5 kB 0 B
frontend/dist/LLMPromptsScene 4.47 kB 0 B
frontend/dist/LLMSkillScene 589 B 0 B
frontend/dist/LLMSkillsScene 606 B 0 B
frontend/dist/Login 8.57 kB 0 B
frontend/dist/Login2FA 4.2 kB 0 B
frontend/dist/logs.js 38.5 kB 0 B
frontend/dist/LogsScene 11.4 kB 0 B
frontend/dist/lua 2.11 kB 0 B
frontend/dist/m3 2.81 kB 0 B
frontend/dist/main 819 kB 0 B
frontend/dist/ManagedMigration 14.1 kB 0 B
frontend/dist/markdown 3.79 kB 0 B
frontend/dist/MarketingAnalyticsScene 39.7 kB 0 B
frontend/dist/MaterializedColumns 10.2 kB 0 B
frontend/dist/Max 801 B 0 B
frontend/dist/mdx 5.39 kB 0 B
frontend/dist/memlens.lib.bundle 27.8 kB 0 B
frontend/dist/MessageTemplate 16.3 kB 0 B
frontend/dist/MetricsScene 828 B 0 B
frontend/dist/mips 2.58 kB 0 B
frontend/dist/ModelsScene 13.6 kB 0 B
frontend/dist/MonacoDiffEditor 403 B 0 B
frontend/dist/monacoEditorWorker 288 kB 0 B
frontend/dist/monacoEditorWorker.js 288 kB 0 B
frontend/dist/monacoJsonWorker 419 kB 0 B
frontend/dist/monacoJsonWorker.js 419 kB 0 B
frontend/dist/monacoTsWorker 7.02 MB 0 B
frontend/dist/monacoTsWorker.js 7.02 MB 0 B
frontend/dist/MoveToPostHogCloud 4.46 kB 0 B
frontend/dist/msdax 4.91 kB 0 B
frontend/dist/mysql 11.3 kB 0 B
frontend/dist/NavTabChat 4.68 kB 0 B
frontend/dist/NewSourceScene 783 B 0 B
frontend/dist/NewTabScene 647 B 0 B
frontend/dist/NodeDetailScene 16.3 kB 0 B
frontend/dist/NotebookCanvasScene 3.16 kB 0 B
frontend/dist/NotebookPanel 5.14 kB 0 B
frontend/dist/NotebookScene 8.17 kB 0 B
frontend/dist/NotebooksScene 7.58 kB 0 B
frontend/dist/OAuthAuthorize 573 B 0 B
frontend/dist/objective-c 2.41 kB 0 B
frontend/dist/Onboarding 734 kB 0 B
frontend/dist/OnboardingCouponRedemption 1.2 kB 0 B
frontend/dist/pascal 2.99 kB 0 B
frontend/dist/pascaligo 2 kB 0 B
frontend/dist/passkeyLogic 484 B 0 B
frontend/dist/PasswordReset 4.32 kB 0 B
frontend/dist/PasswordResetComplete 2.94 kB 0 B
frontend/dist/PendingDeletion 2.21 kB 0 B
frontend/dist/perl 8.25 kB 0 B
frontend/dist/PersonScene 16 kB 0 B
frontend/dist/PersonsScene 4.68 kB 0 B
frontend/dist/pgsql 13.5 kB 0 B
frontend/dist/php 8.02 kB 0 B
frontend/dist/PipelineStatusScene 9.1 kB 0 B
frontend/dist/pla 1.67 kB 0 B
frontend/dist/posthog 144 kB 0 B
frontend/dist/postiats 7.86 kB 0 B
frontend/dist/powerquery 16.9 kB 0 B
frontend/dist/powershell 3.27 kB 0 B
frontend/dist/PreflightCheck 5.53 kB 0 B
frontend/dist/product-tours.js 115 kB 0 B
frontend/dist/ProductTour 273 kB 0 B
frontend/dist/ProductTours 4.68 kB 0 B
frontend/dist/ProjectHomepage 24.7 kB 0 B
frontend/dist/protobuf 9.05 kB 0 B
frontend/dist/pug 4.82 kB 0 B
frontend/dist/python 4.76 kB 0 B
frontend/dist/qsharp 3.19 kB 0 B
frontend/dist/QueryPerformance 6.46 kB 0 B
frontend/dist/r 3.12 kB 0 B
frontend/dist/razor 9.35 kB 0 B
frontend/dist/recorder-v2.js 111 kB 0 B
frontend/dist/recorder.js 111 kB 0 B
frontend/dist/redis 3.55 kB 0 B
frontend/dist/redshift 11.8 kB 0 B
frontend/dist/RegionMap 29.4 kB 0 B
frontend/dist/render-query 20.6 MB 0 B
frontend/dist/render-query.js 20.6 MB +26 B (0%)
frontend/dist/ResourceTransfer 9.17 kB 0 B
frontend/dist/restructuredtext 3.9 kB 0 B
frontend/dist/RevenueAnalyticsScene 25.6 kB 0 B
frontend/dist/ruby 8.5 kB 0 B
frontend/dist/rust 4.16 kB 0 B
frontend/dist/SavedInsights 664 B 0 B
frontend/dist/sb 1.82 kB 0 B
frontend/dist/scala 7.32 kB 0 B
frontend/dist/scheme 1.76 kB 0 B
frontend/dist/scss 6.41 kB 0 B
frontend/dist/SdkDoctorScene 9.4 kB 0 B
frontend/dist/SessionAttributionExplorerScene 6.62 kB 0 B
frontend/dist/SessionGroupSummariesTable 4.62 kB 0 B
frontend/dist/SessionGroupSummaryScene 17 kB 0 B
frontend/dist/SessionProfileScene 15 kB 0 B
frontend/dist/SessionRecordingDetail 1.75 kB 0 B
frontend/dist/SessionRecordingFilePlaybackScene 4.46 kB 0 B
frontend/dist/SessionRecordings 742 B 0 B
frontend/dist/SessionRecordingsKiosk 8.84 kB 0 B
frontend/dist/SessionRecordingsPlaylistScene 4.14 kB 0 B
frontend/dist/SessionRecordingsSettingsScene 1.9 kB 0 B
frontend/dist/SessionsScene 3.98 kB 0 B
frontend/dist/SettingsScene 2.98 kB 0 B
frontend/dist/SharedMetric 4.83 kB 0 B
frontend/dist/SharedMetrics 549 B 0 B
frontend/dist/shell 3.07 kB 0 B
frontend/dist/SignupContainer 25.7 kB 0 B
frontend/dist/Site 1.18 kB 0 B
frontend/dist/solidity 18.6 kB 0 B
frontend/dist/sophia 2.76 kB 0 B
frontend/dist/SourceScene 758 B 0 B
frontend/dist/SourcesScene 6.08 kB 0 B
frontend/dist/sparql 2.55 kB 0 B
frontend/dist/sql 10.3 kB 0 B
frontend/dist/SqlVariableEditScene 7.24 kB 0 B
frontend/dist/st 7.4 kB 0 B
frontend/dist/StartupProgram 21.2 kB 0 B
frontend/dist/SubscriptionScene 12.8 kB 0 B
frontend/dist/SubscriptionsScene 4.89 kB 0 B
frontend/dist/SupportSettingsScene 1.16 kB 0 B
frontend/dist/SupportTicketScene 24.6 kB 0 B
frontend/dist/SupportTicketsScene 733 B 0 B
frontend/dist/Survey 848 B 0 B
frontend/dist/SurveyFormBuilder 1.54 kB 0 B
frontend/dist/Surveys 18.2 kB 0 B
frontend/dist/surveys.js 90 kB 0 B
frontend/dist/SurveyWizard 64.3 kB 0 B
frontend/dist/swift 5.26 kB 0 B
frontend/dist/SystemStatus 16.8 kB 0 B
frontend/dist/systemverilog 7.61 kB 0 B
frontend/dist/TaskDetailScene 21.5 kB 0 B
frontend/dist/TaskTracker 13.2 kB 0 B
frontend/dist/tcl 3.57 kB 0 B
frontend/dist/TextCardMarkdownEditor 11 kB 0 B
frontend/dist/toolbar 10.6 MB 0 B
frontend/dist/toolbar.js 10.6 MB +13 B (0%)
frontend/dist/ToolbarLaunch 2.52 kB 0 B
frontend/dist/tracing-headers.js 1.74 kB 0 B
frontend/dist/TracingScene 29.8 kB 0 B
frontend/dist/TransformationsScene 1.95 kB 0 B
frontend/dist/tsMode 24 kB 0 B
frontend/dist/twig 5.97 kB 0 B
frontend/dist/TwoFactorReset 3.98 kB 0 B
frontend/dist/typescript 240 B 0 B
frontend/dist/typespec 2.82 kB 0 B
frontend/dist/Unsubscribe 1.62 kB 0 B
frontend/dist/UserInterview 4.53 kB 0 B
frontend/dist/UserInterviews 2.01 kB 0 B
frontend/dist/vb 5.79 kB 0 B
frontend/dist/VercelConnect 4.95 kB 0 B
frontend/dist/VercelLinkError 1.91 kB 0 B
frontend/dist/VerifyEmail 4.48 kB 0 B
frontend/dist/vimMode 211 kB 0 B
frontend/dist/VisualReviewRunScene 26.7 kB 0 B
frontend/dist/VisualReviewRunsScene 6.12 kB 0 B
frontend/dist/VisualReviewSettingsScene 10.8 kB 0 B
frontend/dist/web-vitals.js 6.39 kB 0 B
frontend/dist/WebAnalyticsScene 5.77 kB 0 B
frontend/dist/WebGLRenderer-DYjOwNoG 60.3 kB 0 B
frontend/dist/WebGPURenderer-B_wkl_Ja 36.3 kB 0 B
frontend/dist/WebScriptsScene 2.57 kB 0 B
frontend/dist/webworkerAll-puPV1rBA 324 B 0 B
frontend/dist/wgsl 7.34 kB 0 B
frontend/dist/Wizard 4.45 kB 0 B
frontend/dist/WorkflowScene 101 kB 0 B
frontend/dist/WorkflowsScene 58.3 kB 0 B
frontend/dist/WorldMap 4.73 kB 0 B
frontend/dist/xml 2.98 kB 0 B
frontend/dist/yaml 4.6 kB 0 B

compressed-size-action

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

Migration SQL Changes

Hey 👋, we've detected some migrations on this PR. Here's the SQL output for each migration, make sure they make sense:

products/data_warehouse/backend/migrations/0045_alter_externaldatasource_source_type.py

BEGIN;
--
-- Alter field source_type on externaldatasource
--
-- (no-op)
COMMIT;

Last updated: 2026-04-20 17:59 UTC (6105f2a)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

🔍 Migration Risk Analysis

We've analyzed your migrations for potential risks.

Summary: 0 Safe | 1 Needs Review | 0 Blocked

⚠️ Needs Review

May have performance impact

data_warehouse.0045_alter_externaldatasource_source_type
  └─ #1 ⚠️ AlterField
     Field alteration may cause table locks or data loss (check if changing type or constraints)
     model: externaldatasource, field: source_type, field_type: CharField

Last updated: 2026-04-20 17:59 UTC (6105f2a)

@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog Bot commented Apr 7, 2026

⏭️ Skipped snapshot commit because branch advanced to f3c73f7 while workflow was testing a4905b8.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

  • Merge master into your branch, or
  • Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 7, 2026

🎭 Playwright report · View test results →

⚠️ 3 flaky tests:

  • creates a Postgres direct source and queries it successfully (chromium)
  • Save view (chromium)
  • Materialize view pane (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog Bot commented Apr 13, 2026

Query snapshots: Backend query snapshots updated

Changes: 2 snapshots (2 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

- Bound the duplicate-primary-key probe to a 10M-row prefix with
  read_overflow_mode='break' so misconfiguration detection is O(budget)
  instead of a full-table GROUP BY on every incremental sync. Fail-safe
  flips to True on unexpected errors to block merges against unverifiable
  keys.
- Add optimize_read_in_order and max_bytes_before_external_sort to the
  data query. When the cursor leads the sorting key the top-level sort is
  skipped; otherwise we spill to disk instead of OOMing. Warn when the
  cursor isn't a sort-key prefix.
- Accumulate streamed Arrow blocks into ~200 MiB / ~100k-row pa.Tables
  before yielding, collapsing the Delta commit count by ~5x on large
  tables without raising peak memory meaningfully.
- Replace the full-table row count on incremental resumes with a bounded
  WHERE cursor > last_value count so progress reporting tracks actual
  work. Default rows_to_sync to None instead of 0 when unknown.
- Widen _get_client exception wrapping to cover OSError and ssl.SSLError
  alongside ClickHouseError.

Made-with: Cursor
…to dc-feature-clickhouse-source

Made-with: Cursor

# Conflicts:
#	posthog/api/test/__snapshots__/test_api_docs.ambr
@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog Bot commented Apr 16, 2026

⏭️ Skipped snapshot commit because branch advanced to a3c9787 while workflow was testing bc089a1.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

  • Merge master into your branch, or
  • Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

- password: str -> str | None across clickhouse.py signatures (matches
  ClickHouseSourceConfig), coerce to "" at the clickhouse-connect boundary
- pa.timestamp: branch on optional tz and tighten _datetime_unit_for_precision
  return to Literal so the overload resolves
- test: narrow response.items() away from AsyncIterable before list()

Made-with: Cursor
@danielcarletti danielcarletti marked this pull request as ready for review April 16, 2026 17:38
@danielcarletti danielcarletti marked this pull request as draft April 16, 2026 17:38
@assign-reviewers-posthog assign-reviewers-posthog Bot requested review from a team April 16, 2026 17:38
@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog Bot commented Apr 16, 2026

⏭️ Skipped snapshot commit because branch advanced to e645817 while workflow was testing a3c9787.

The new commit will trigger its own snapshot update workflow.

If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:

  • Merge master into your branch, or
  • Push an empty commit: git commit --allow-empty -m 'trigger CI' && git push

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Prompt To Fix All With AI
This is a comment left during a code review.
Path: posthog/temporal/data_imports/sources/clickhouse/clickhouse.py
Line: 423-428

Comment:
**Wrong precision/scale for Decimal shorthand types**

For `Decimal32(S)`, `Decimal64(S)`, `Decimal128(S)`, and `Decimal256(S)` the single argument is the **scale**, not the precision — the precision is fixed by the variant (9 / 18 / 38 / 76). The current regex puts `S` into group 1 and interprets it as precision with an implied scale of 0, so `Decimal32(4)` produces `pa.decimal128(4, 0)` instead of the correct `pa.decimal128(9, 4)`. ClickHouse sends Arrow data with the real precision/scale, so the registered Delta schema and the actual wire schema disagree — downstream writes can fail or silently corrupt values.

The test `test_decimal_types` only asserts `isinstance(…, Decimal128Type)` so it doesn't catch the wrong precision/scale values.

Suggested fix — split the two forms:

```python
_DECIMAL_FIXED_WIDTHS: dict[str, int] = {"32": 9, "64": 18, "128": 38, "256": 76}
_DECIMAL_FIXED_RE = re.compile(r"^Decimal(32|64|128|256)\(\s*(\d+)\s*\)$")
_DECIMAL_VAR_RE = re.compile(r"^Decimal\(\s*(\d+)\s*(?:,\s*(\d+)\s*)?\)$")
```

Then in `_inner_to_arrow_type`:
```python
match_fixed = _DECIMAL_FIXED_RE.match(inner)
if match_fixed is not None:
    precision = _DECIMAL_FIXED_WIDTHS[match_fixed.group(1)]
    scale = int(match_fixed.group(2))
    return build_pyarrow_decimal_type(precision, scale)

match_dec = _DECIMAL_VAR_RE.match(inner)
if match_dec is not None:
    precision = int(match_dec.group(1))
    scale = int(match_dec.group(2)) if match_dec.group(2) is not None else 0
    return build_pyarrow_decimal_type(precision, scale)
```

And the test should be extended to assert both `field.type.precision` and `field.type.scale`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/temporal/data_imports/sources/clickhouse/clickhouse.py
Line: 672-699

Comment:
**`incremental_field_type` parameter accepted but never used**

`_build_query` accepts `incremental_field_type` solely to validate it is not `None`, but the value is never used in the query string or the returned parameter dict (the returned `{}` is always discarded with `_` at the call site in `get_rows`). The parameter is superfluous — the guard on `incremental_field` alone is sufficient, and the type-specific logic (`incremental_type_to_initial_value`) already lives in `get_rows`.

```suggestion
def _build_query(
    *,
    database: str,
    table_name: str,
    should_use_incremental_field: bool,
    incremental_field: Optional[str],
) -> str:
    """Build the data extraction query.

    Returns the SQL string. We never interpolate the incremental cursor
    value directly — only identifiers (which are validated) end up in the
    SQL string.
    """
    qualified = _qualified_table(database, table_name)

    if not should_use_incremental_field:
        return f"SELECT * FROM {qualified}"

    if incremental_field is None:
        raise ValueError("incremental_field can't be None when should_use_incremental_field is True")

    quoted_field = _quote_identifier(incremental_field)
    return f"SELECT * FROM {qualified} WHERE {quoted_field} > %(last_value)s ORDER BY {quoted_field} ASC"
```

The call site in `get_rows` would change to `query = _build_query(...)` and you can drop `incremental_field_type=incremental_field_type` from that call.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(data-warehouse): satisfy mypy for Cl..." | Re-trigger Greptile

Comment thread posthog/temporal/data_imports/sources/clickhouse/clickhouse.py Outdated
Comment thread posthog/temporal/data_imports/sources/clickhouse/clickhouse.py Outdated
@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog Bot commented Apr 16, 2026

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

- query_arrow_stream yields RecordBatches; switch accumulator to
  pa.Table.from_batches to avoid the pa.concat_tables type mismatch.
- Build an explicit SELECT list and wrap Arrow-incompatible column types
  (UUID, IPv4/6, wide ints, Enum*, FixedString, Array, Map, Tuple,
  Nested, Variant, Dynamic, JSON, Object) in toString() to avoid
  ClickHouse error 50 on SELECT *.
- Extend row-count discovery to Distributed tables (SELECT count()
  fallback) and MaterializedViews (resolve TO target, else .inner_id
  inner table). Plain views and no-counter engines stay skipped.
- Upgrade discovery/query log lines to info so users see them on the
  syncs tab; add an entry log for get_rows().
- Frontend: show "Skipped" with an explanatory tooltip instead of
  "Unknown" when row count is unavailable.
- Add get_primary_keys_for_schemas that reuses _get_primary_keys per
  table and wire detected_primary_keys into SourceSchema so the
  frontend can suggest sorting-key columns during setup.
- Split DecimalN(S) from Decimal(P[, S]) — the former has fixed
  precision (9/18/38/76) and the lone arg is scale. Previous regex
  mis-mapped Decimal32(4) to Decimal(4, 0). Tests now assert exact
  precision and scale.
- Drop the unused incremental_field_type parameter from _build_query
  and return a plain SQL string instead of (str, dict). The type-aware
  cursor seeding already lives in get_rows.
@danielcarletti danielcarletti marked this pull request as ready for review April 16, 2026 20:32
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Prompt To Fix All With AI
This is a comment left during a code review.
Path: products/data_warehouse/frontend/shared/components/forms/SchemaForm.tsx
Line: 109-113

Comment:
**ClickHouse-specific tooltip in a shared component**

The tooltip text references "Memory/Buffer/Log-engine tables, or Kafka/URL table functions" — these are ClickHouse engine names that make no sense to a Postgres, MySQL, or Snowflake user seeing a null row count. The `SchemaForm` component is shared across every source; any source that fails to return a row count (e.g. due to a permissions error, or simply because a given source never populates that field) will now surface ClickHouse-specific jargon to unrelated users.

```suggestion
                                    return (
                                        <Tooltip title="Row count is unavailable for this table. The table can still be synced — we just don't know its size up front.">
                                            <span className="text-muted-alt cursor-help">Skipped</span>
                                        </Tooltip>
                                    )
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix(data-warehouse): detect primary keys..." | Re-trigger Greptile

Cast the table name parameter to str before dict lookup to match the
typed dict's key.
Drop tests that only assert trivial string formatting or exact dict
values — keep the ones that exercise real logic (regex parsing, error
translation substring match, engine-specific row-count branches, etc.).
Comment thread posthog/temporal/data_imports/sources/clickhouse/clickhouse.py Outdated
@tests-posthog
Copy link
Copy Markdown
Contributor

tests-posthog Bot commented Apr 20, 2026

Query snapshots: Backend query snapshots updated

Changes: 1 snapshots (1 modified, 0 added, 0 deleted)

What this means:

  • Query snapshots have been automatically updated to match current output
  • These changes reflect modifications to database queries or schema

Next steps:

  • Review the query changes to ensure they're intentional
  • If unexpected, investigate what caused the query to change

Review snapshot changes →

Check for an empty or missing table name before indexing so callers get
the "Table name is missing" ValueError instead of an IndexError.
Resolve SchemaForm.tsx conflict — keep master's new primary-key column
and nested LemonCollapse structure while preserving the "Skipped"
tooltip that replaces the "Unknown" row-count label.
@danielcarletti danielcarletti merged commit 9edf2f9 into master Apr 20, 2026
235 checks passed
@danielcarletti danielcarletti deleted the dc-feature-clickhouse-source branch April 20, 2026 18:48
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented Apr 20, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-04-20 19:24 UTC Run
prod-us ✅ Deployed 2026-04-21 04:41 UTC Run
prod-eu ✅ Deployed 2026-04-20 20:04 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants