Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: bson timestamp type casting #2593

Merged
merged 2 commits into from
Feb 3, 2024
Merged

fix: bson timestamp type casting #2593

merged 2 commits into from
Feb 3, 2024

Conversation

tychoish
Copy link
Collaborator

@tychoish tychoish commented Feb 2, 2024

Closes #2590
Closes #2585

statement ok
copy timestamps to '${TMP}/timestamp_test_out.bson';

# bson's date format is an int64, of milliseconds since unix epoch;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there the '\x09'

"UTC datetime - The int64 is UTC milliseconds since the Unix epoch."

which isn't part of the internal mongodb spec

https://bsonspec.org/spec.html

Copy link
Collaborator Author

@tychoish tychoish Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, exactly, that's what we're using. We just convert all arrow Date and Timestamp values into these int64 (millis since epoch), but when we read bson data back into GlareDB all of those values become Date64; not back to timestamps.

> create table timestamps (t timestamp);
Table created
> insert into timestamps values (arrow_cast(946684860000000, 'Timestamp(Microsecond, None)'));
Inserted 1 row
>
>
> select * from timestamps;
┌─────────────────────┐
│ t                   │
│ ──                  │
│ Timestamp<µs, UTC>  │
╞═════════════════════╡
│ 2000-01-01T00:01:00 │
└─────────────────────┘
> copy timestamps to 'timestamps.bson';
Copy success
> select * from timestamps;
┌─────────────────────┐
│ t                   │
│ ──                  │
│ Timestamp<µs, UTC>  │
╞═════════════════════╡
│ 2000-01-01T00:01:00 │
└─────────────────────┘
> select * from 'timestamps.bson';
┌─────────────────────┐
│ t                   │
│ ──                  │
│ Date64              │
╞═════════════════════╡
│ 2000-01-01T00:01:00 │
└─────────────────────┘

It works as intended but sqllogictest renders Date64 as int64s.

When we add support for declaring a schema rather than inferring it, we'll be able to cast these bson dates to whatever we need (with only the expected loss of fidelity and resolution), and the code mostly supports this today.


which isn't part of the internal mongodb spec

The internal one is \x11 which they call Timestamp (it's an int32 of seconds-since-epoch, and a int32 increment, but it's used as a logical/monotonic clock, but a sort of weird one to keep operations in the replication log (oplog) in order.) BSON libraries implement but no one should use it.

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Id like to keep #2590 open as we still technically can't roundtrip those types. Since we aren't notifying the user, or have it documented anywhere that you can't roundtrip these types, I think the issue is still valid.

@tychoish
Copy link
Collaborator Author

tychoish commented Feb 3, 2024

Id like to keep #2590 open as we still technically can't roundtrip those types. Since we aren't notifying the user, or have it documented anywhere that you can't roundtrip these types, I think the issue is still valid.

I understand the desire for precision, but the same issue impacts JSON and CSV export (only in even more cases, as they have fewer types,) and more don't round trip. Parquet, even, has a less sophisticated type system than arrow (and therefore glaredb) and would probably lose some type fidelity in the round trip. We can keep a general issue or epic open for "round trip types" as a placeholder for these kinds of issues, but the solutions are all a bit imperfect, so I doubt we'd ever be able to truly resolve it. Having a placeholder might be useful, or it might just confuse things even more.

There's also nothing actionable in the glaredb code base to do for this, about BSON right now: we can't create new types in the upstream format. We could document this, but we should track that in a documentation issue. This would be a good page to have at some point.

We could consider implementing or providing a "strict" mode (or more likely various strict-modes), with regards to type conversions where we could propagate errors for any non-roundtrip-able types, but that's a different issue, and would impact most data sources. I think that's a different issue.

Being able to declare the schema rather than inferring it from the external type-system, provides us the opportunity to up-cast values, and perhaps change some of the down-casting semantics to support something that more closely approximates round tripping. This is something that we'd like to do and can work on, in the near term, and I think has the greatest effort-to-impact ratio, but it's also a different issue.

@tychoish tychoish merged commit f8be59e into main Feb 3, 2024
22 checks passed
@tychoish tychoish deleted the tycho/bson-timestamp-fixes branch February 3, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bson roundtrip doesn't work with timestamp fields panic when copying to bson
2 participants