fix(dbt): specify utf-8 encoding when opening files#4503
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses Windows-specific Unicode errors in the dbt integration by making file encoding explicit (UTF-8) when reading/writing dbt artifacts and structured log files.
Changes:
- Specify
encoding="utf-8"for dbt log file creation/opening in the structured logs processor. - Specify
encoding="utf-8"when opening dbt manifest/metadata JSON and YAML files in the local artifact processor. - Add a new root-level markdown file (
issue_encoding.md) containing the issue description.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| issue_encoding.md | Adds a markdown note describing the encoding issue and proposed fix. |
| integration/common/src/openlineage/common/provider/dbt/structured_logs.py | Opens/creates the dbt log file using UTF-8 to avoid platform-default encoding issues. |
| integration/common/src/openlineage/common/provider/dbt/local.py | Reads JSON/YAML artifacts using UTF-8 to avoid platform-default encoding issues. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ### Description | ||
| In `integration/common/src/openlineage/common/provider/dbt/structured_logs.py` and `integration/common/src/openlineage/common/provider/dbt/local.py`, the `open()` function is used without specifying an `encoding` argument. | ||
|
|
||
| This relies on the platform's default encoding (which is `cp1252` on Windows) and can cause `UnicodeDecodeError` or `UnicodeEncodeError` when reading/writing logs or manifest files that contain non-ASCII characters. | ||
|
|
||
| ### Proposed fix | ||
| Add `encoding="utf-8"` to all `open()` calls in the `dbt` provider module. | ||
|
|
||
| ### Location of missing encodings | ||
| - `integration/common/src/openlineage/common/provider/dbt/local.py` (lines 201, 236) | ||
| - `integration/common/src/openlineage/common/provider/dbt/structured_logs.py` (lines 918, 921) |
There was a problem hiding this comment.
This file looks like a copied issue description and doesn’t appear to be used by the codebase or referenced from docs. If it was only added for PR context, consider removing it (or moving it into the appropriate documentation location) to avoid adding an untracked root-level artifact to releases/packages.
Signed-off-by: Harish Chandra Thakur <hcthakur2004@email.com>
6c71fa5 to
6a0a449
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4503 +/- ##
=======================================
Coverage 72.83% 72.83%
=======================================
Files 21 21
Lines 2271 2271
=======================================
Hits 1654 1654
Misses 617 617 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks @hcthakur2004 |
…ns facet move - Update _get_schema() in generated Python file to reference schema version 1-2-0 - Move dataQualityAssertions from inputFacets to facets in dbt test snapshots - Update _schemaURL in snapshots from 1-1-0 to 1-2-0 - Rebase on upstream/main to incorporate upstream dbt fixes (OpenLineage#4499-OpenLineage#4503) Signed-off-by: Rohit Sharma <rohitcse.gec@gmail.com>
Fixes #4502