Skip to content

Skip inaccessible databases in DBM schema collection instead of aborting#23807

Open
pierreln-dd wants to merge 10 commits into
masterfrom
pierreln-dd/SDBM-2634-schema-collector-per-db-error-isolation
Open

Skip inaccessible databases in DBM schema collection instead of aborting#23807
pierreln-dd wants to merge 10 commits into
masterfrom
pierreln-dd/SDBM-2634-schema-collector-per-db-error-isolation

Conversation

@pierreln-dd
Copy link
Copy Markdown
Contributor

@pierreln-dd pierreln-dd commented May 22, 2026

What does this PR do?

When a database cannot be opened during schema collection, the per-database error is now caught and the database is skipped with a warning instead of aborting collection for all databases. All other databases on the instance continue to be collected and flushed normally.

Changes across datadog_checks_base, sqlserver, and postgres:

  • Error isolation: _collect_database_schemas (extracted helper) wraps each database in a try/except guarded by _is_connection_error. SQL Server catches pyodbc.Error; Postgres catches psycopg.OperationalError. Internal errors (logic bugs, uninitialized state) propagate as before.
  • Observability: dd.<dbms>.schema.skipped_databases_count gauge emitted each run; status:partial tag on all schema metrics when at least one database was skipped.
  • Logging: skip warning includes the stack trace (exc_info=True).
  • Protocol correctness: terminal maybe_flush(is_last_payload=True) always fires so the backend receives collection_payloads_count regardless of skip count.
  • Readability: extracted _collect_database_schemas to reduce nesting depth in collect_schemas.

Motivation

When database_autodiscovery discovers a database that cannot be opened, _get_cursor() raises a driver-level error. The outer collect_schemas() previously caught and re-raised this, discarding all schemas collected from databases processed earlier in the same run. Accessible databases on the same instance reported no schema data.

This affects SQL Server and Postgres integrations that inherit from the shared SchemaCollector base class.

Related escalation: SDBM-2634.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add qa/required if this PR needs QA validation, or qa/skip-qa if it does not. Exactly one of the two is required.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

When a database is unreachable (e.g. version mismatch, restricted state),
_get_cursor raises and the outer collect_schemas re-raises, discarding all
schemas already collected for other databases. Wrap per-database iteration in
a try/except so that a single failing database is skipped with a warning and
collection continues for the rest. Move the final flush outside the loop to
ensure queued rows are submitted regardless of which database was last.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 22, 2026

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

PR All | test / j563739d / Gitlab Runner   View in Datadog   GitHub Actions

🔄 Retry job. This looks flaky and may succeed on retry. Http status code 502 on url http://localhost:8085/ci

PR All | test / j7faf392 / CockroachDB   View in Datadog   GitHub Actions

🔄 Retry job. This looks flaky and may succeed on retry. Agent check failed for 'cockroachdb', context canceled due to unable to start CMD API server: unable to listen on localhost:32967: address already in use.

🧪 1 Test failed in 1 job

PR All | run   GitHub Actions

test_version_metadata from test_integration.py   View in Datadog (Fix with Cursor)

Traceback (most recent call last):
  File &#34;/home/runner/work/integrations-core/integrations-core/datadog_checks_base/datadog_checks/base/checks/base.py&#34;, line 1423, in run
    self.check(instance)
    ~~~~~~~~~~^^^^^^^^^^
  File &#34;/home/runner/work/integrations-core/integrations-core/datadog_checks_base/datadog_checks/base/utils/tracing.py&#34;, line 55, in wrapper
    return f(self, *args, **kwargs)
  File &#34;/home/runner/work/integrations-core/integrations-core/datadog_checks_base/datadog_checks/base/utils/tracing.py&#34;, line 55, in wrapper
    return f(self, *args, **kwargs)
  File &#34;/home/runner/work/integrations-core/integrations-core/gitlab_runner/datadog_checks/gitlab_runner/gitlab_runner.py&#34;, line 65, in check
...

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 90.62%
Overall Coverage: 87.47%

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 2338fe4 | Docs | Datadog PR Page | Give us feedback!

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 90.62500% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.90%. Comparing base (411c31d) to head (2338fe4).
⚠️ Report is 6 commits behind head on master.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pierreln-dd and others added 2 commits May 22, 2026 13:08
- Add _is_connection_error hook so subclasses can restrict which errors
  are treated as recoverable (defaults to True to preserve behavior)
- Remove redundant continue in except block
- Add exc_info=True to warning so stack trace is preserved in logs
- Track _skipped_databases_count and emit it as a gauge metric so
  partial collection failures are observable
- Add unit tests for error isolation and the _is_connection_error hook

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…add sqlserver override

- Guard final maybe_flush so no empty payload is emitted when all databases
  are skipped or no data was collected (fixes new behavioral regression)
- Add SQLServerSchemaCollector._is_connection_error that only catches pyodbc
  errors, preventing internal errors like uninitialized pre-2017 cursor from
  being silently swallowed and misclassified as per-database skips
- Add test asserting no payload is emitted when all databases fail

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
pierreln-dd and others added 2 commits May 22, 2026 14:12
collect_schemas was reaching 6 levels of nesting. Extracting per-database
logic into _collect_database_schemas brings the max depth to 4 and makes
the high-level flow readable without scrolling past error handling.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…erminal flush

- Add PostgresSchemaCollector._is_connection_error to narrow recovery to
  psycopg.Error only, matching the SQL Server approach
- Remove the empty-payload guard: always emit the terminal flush so the
  backend snapshotting protocol receives collection_payloads_count regardless
  of how many databases were collected
- Emit status:partial on schema metrics when one or more databases were
  skipped, distinguishing partial collection from full success or hard error
- Update tests to assert status:partial tag and terminal payload presence

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
pierreln-dd and others added 3 commits May 22, 2026 14:45
… customer-facing

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- Rewrite sqlserver changelog to customer-facing language without implementation examples
- Remove example details from _is_connection_error docstrings
- Narrow PostgresSchemaCollector._is_connection_error to psycopg.OperationalError
  so only connection-level failures skip a database (ProgrammingError and other
  non-access errors propagate as hard failures)
- Add unit tests in sqlserver and postgres verifying the override catches driver
  errors but not internal exceptions

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@pierreln-dd pierreln-dd marked this pull request as ready for review May 22, 2026 14:01
@pierreln-dd pierreln-dd requested review from a team as code owners May 22, 2026 14:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2231e1168f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +124 to +128
while next_row:
self._queued_rows.append(self._map_row(database, next_row))
self._total_rows_count += 1
next_row = self._get_next(cursor)
self.maybe_flush(is_last_payload=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid emitting rows for a database later skipped

When a recoverable driver error happens after this loop has already appended rows for the current database (or after maybe_flush has sent a chunk), the except below counts the database as skipped but leaves those partial rows in the snapshot. This can happen for SQL Server legacy collection because _map_row runs additional per-table queries, or for any cursor that fails during fetch, so an inaccessible/flaky database may still replace backend metadata with an incomplete schema. Buffer per-database rows and only merge/flush them after the database completes, or only apply the skip path to cursor-open failures.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Addressed in the latest commit: unflushed rows accumulated for the current database are now discarded from _queued_rows in the except block before the skip is recorded. For the common case (database fails before or early in collection, no chunk flush yet) the snapshot stays clean. For large databases where a mid-loop maybe_flush already sent a chunk, those rows can't be recalled — a separate warning is emitted to make the incomplete snapshot visible in logs.

If an error fires after some rows have already been appended to the queue
(e.g. _map_row failure on the nth table), roll back the unflushed rows for
that database so they don't pollute the snapshot. Rows already sent via a
mid-loop chunk flush cannot be recalled; log a warning in that case so the
incomplete snapshot is visible.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 22, 2026

Validation Report

All 21 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
qa-label Validate the pull request declares whether it needs QA for the next Agent release
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant