-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a history of storage format versions et. al. #5205
Conversation
2f6f84a
to
1bcd345
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I linked each change to the pull request that introduced it.
|
||
## Version 22 | ||
|
||
Introduced in TileDB 2.25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add the commit or PR that bumped the version, because several of the changes did not happen at the same time the version was bumped, and the purpose of this spec is that you don't need to read the source code to understand it.
format_spec/history.md
Outdated
|
||
Introduced in TileDB 2.25 | ||
|
||
* The _Current domain_ field was added to [array schemas](./array_schema.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format_spec/history.md
Outdated
|
||
Introduced in TileDB 2.19 | ||
|
||
* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for fixed-size strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Introduced in TileDB 2.17 | ||
|
||
* Arrays can have [enumerations](./enumeration.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Introduced in TileDB 2.19 | ||
|
||
* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for nullable fixed-size strings on dense arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General feedback: should we be including links to PRs in the Markdown itself? If you don't think that's appropriate, we can skip, but in the past open-source projects I've worked on, it was common practice to publicly include the PRs in all release documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely not required as I said above, and personally I would prefer to clearly separate the storage format specification from its implementation, but I'm not strongly opposed to it.
@ihnorton what do you think?
|
||
* Deleting data from arrays is supported. Arrays can have [delete commit files](./delete_commit_file.md). | ||
* Updating data in arrays is supported. Arrays can have [update commit files](./update_commit_file.md). | ||
* Fragment metadata contain [tile processed conditions](./fragment.md#tile-processed-conditions). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format_spec/history.md
Outdated
|
||
Introduced in TileDB 2.12 | ||
|
||
* Deleting data from arrays is supported. Arrays can have [delete commit files](./delete_commit_file.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced in TileDB 2.14 | ||
|
||
* The _Order_ field was added to [attributes](./array_schema.md#attribute). | ||
* Cell offsets in dimensions or attributes of UTF-8 string type are not written in the offset tiles, if the RLE or dictionary filter exists in the filter pipeline. They are instead encoded as part of the data tile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure, from the storage format perspective does this change only mean that the offsets file will be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so
|
||
* The [array file hierarchy](./array_file_hierarchy.md) was updated to store fragments, commits and consolidated fragment metadata in separate subdirectories. | ||
* The extension of commit files was changed to `.wrt`. | ||
* Cell offsets in dimensions or attributes of ASCII string type are not written in the offset tiles, if the RLE filter exists in the filter pipeline. They are instead encoded as part of the data tile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same with the dictionary filter happened in version 13, but I am not including it because that's when the dictionary filter was added.
c2f4883
to
5465e41
Compare
5465e41
to
ddfd4d8
Compare
* The [MBR](./fragment.md#mbr) structure has been updated to support variable-sized dimensions. | ||
* The _Dimension number_ and _R-Tree datatype_ fields have been removed from [R-Trees](./fragment.md#r-tree). | ||
* The _Allows dups_ field was added to [array schemas](./array_schema.md#array-schema-file). | ||
* Committed fragments are indicated by the presence of an `.ok` file in the array's directory, with the same [timestamped name](./timestamped_name.md) as the fragment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The _Domain datatype_ field was removed from [domains](./array_schema.md#domain). | ||
* The [MBR](./fragment.md#mbr) structure has been updated to support variable-sized dimensions. | ||
* The _Dimension number_ and _R-Tree datatype_ fields have been removed from [R-Trees](./fragment.md#r-tree). | ||
* The _Allows dups_ field was added to [array schemas](./array_schema.md#array-schema-file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Introduced in TileDB 2.2.3 | ||
|
||
* [Data files](./fragment.md#data-file) are named by the name of their attribute or dimension, after percent encoding certain characters. These characters are `!#$%&'()*+,/:;=?@[]`, as specified in [RFC 3986](https://tools.ietf.org/html/rfc3986), as well as `"<>\|`, which are not allowed in Windows file names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
## Version 8 | ||
|
||
Introduced in TileDB 2.2.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Introduced in 2.3.x, backported to 2.2.x in #2054.
|
||
Introduced in TileDB 2.0 | ||
|
||
* Dimensions are stored in separate [data files](./fragment.md#data-file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to mention zipped coordinates here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My plan of documenting zipped coordinates is to update fragment.md
in a subsequent PR to say something like "prior to format version 6 dimension data for sparse arrays are stored in a __coords
file…".
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
* Attributes can be nullable. | ||
* The _Nullable_ and _Fill value validity_ fields were added to [attributes](./array_schema.md#attribute). | ||
* The _Validity filters_ field was added to [array schemas](./array_schema.md#array-schema-file). | ||
* Fragment metadata contain validity [tile offsets](./fragment.md#tile-offsets). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Introduced in TileDB 2.1 | ||
|
||
* The _Fill value_ field was added to [attributes](./array_schema.md#attribute). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Sparse arrays can have string dimensions and dimensions with different datatypes. | ||
* The _Dimension datatype_, _Cell val num_ and _Filters_ fields were added to [dimensions](./array_schema.md#dimension). | ||
* The _Domain size_ field was added to [dimensions](./array_schema.md#dimension). The domain of a dimension can have a variable size. | ||
* The _Domain datatype_ field was removed from [domains](./array_schema.md#domain). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The [MBR](./fragment.md#mbr) structure has been updated to support variable-sized dimensions. | ||
* The _Dimension number_ and _R-Tree datatype_ fields have been removed from [R-Trees](./fragment.md#r-tree). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1bf7276
to
ebeb70c
Compare
* The structure of [fragment metadata files](./fragment.md#fragment-metadata-file) was overhauled. | ||
* The [footer](./fragment.md#footer) and [R-Tree](./fragment.md#r-tree) structures were added. | ||
* The _Bounding coords_ field was removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Introduced in TileDB 1.5 | ||
|
||
* Cell coordinate values of each dimension are always stored next to each other, regardless of whether they are filtered with a compression filter or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part of the work is done. For ease of review we can do the second part in a subsequent PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good to me! Only one comment about including the issues directly in line with the release notes.
|
||
Introduced in TileDB 2.19 | ||
|
||
* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for nullable fixed-size strings on dense arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General feedback: should we be including links to PRs in the Markdown itself? If you don't think that's appropriate, we can skip, but in the past open-source projects I've worked on, it was common practice to publicly include the PRs in all release documentation.
|
||
Introduced in TileDB 2.0 | ||
|
||
* Dimensions are stored in separate [data files](./fragment.md#data-file). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to mention zipped coordinates here.
|
||
Introduced in TileDB 2.9 | ||
|
||
* The [dictionary filter](./filters/dictionary_encoding.md) was added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider mentioning:
Cell offsets in dimensions or attributes of ASCII string type are not written in the offset tiles, if the RLE filter exists in the filter pipeline. They are instead encoded as part of the data tile.
format_spec/history.md
Outdated
|
||
Introduced in TileDB 2.10 | ||
|
||
* Fragments can have timestamp files. The _Includes timestamps_ field was added to the [fragment metadata footer](./fragment.md#footer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Fragments can have timestamp files. The _Includes timestamps_ field was added to the [fragment metadata footer](./fragment.md#footer). | |
* Consolidated fragments can have timestamp files. The _Includes timestamps_ field was added to the [fragment metadata footer](./fragment.md#footer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
format_spec/history.md
Outdated
|
||
Introduced in TileDB 2.11 | ||
|
||
* Fragments can have delete metadata files. The _Includes delete metadata_ field was added to the [fragment metadata footer](./fragment.md#footer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Fragments can have delete metadata files. The _Includes delete metadata_ field was added to the [fragment metadata footer](./fragment.md#footer). | |
* Consolidated fragments can have delete metadata files. The _Includes delete metadata_ field was added to the [fragment metadata footer](./fragment.md#footer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wasn't apparent from a read of the source code but sure, it's valuable to mention. Reminder for myself to update the documentation of this flag in a subsequent PR.
format_spec/history.md
Outdated
* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for nullable fixed-size strings on dense arrays. | ||
|
||
> [!NOTE] | ||
> This version does not contain any changes to the storage format, but was introduced as an indicator for implementations to not rely on tile metadata under certain circumstances on previous versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only applies to fixed sized strings on dense arrays I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this in the bullet point above and did not want to repeat myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly worried someone reads this "not rely on tile metadata under certain circumstances on previous versions." and makes the wrong assumption. I think we should repeat for 100% clarity. This is the last issue for me in this PR, once we resolve it's good to go IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Co-authored-by: Theodore Tsirpanis <theodore.tsirpanis@tiledb.com>
[SC-54621](https://app.shortcut.com/tiledb-inc/story/54621/explicitly-mention-what-changed-in-each-storage-format-version-throughout-the-specification) Continuing #5205, this PR goes through each format version and mentions the changes to the main specification document. It also documents structures like `__coords.tdb` and legacy fragment metadata, and fixes any defects encountered in the meantime. --- TYPE: FORMAT DESC: The storage format specification was updated to document format changes of previous versions throughout the main document. --------- Co-authored-by: Nick Vigilante <nickvigilante@users.noreply.github.com>
SC-50706
This PR updates the storage format specification to include information about all past versions. The work is divided into two parts. First, a new file was created that lists what changed in each storage format version, going in more detail than the existing table (which will be removed). After that, the fields in the various data structures will be updated to indicate in which version they were introduced.
I sourced the changes in each storage format by searching for the
version(\(\)|_)? (>=?|<=?|==) \d+
regex in the code, as well as searching for references of these named constants.I also made other small fixes to the spec as I found them.
Information about past format versions of groups will be added in another PR.
TYPE: NO_HISTORY
DESC: The storage format specification was updated to include information about all previous versions.