-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add new index format, to overcome fundamental 32-bit (< 2 GB) limitation for current index files #550
Comments
Index files increase performance significantly for files that are read
sparsely with many unused grib2 records in between.
IMO we should just document this 2 gbyte limit in grib2 files and move on
even though it is a major limitation. I had to use wgrib2 to
get around it and that only worked because I subsetted the grib2 file into
another one much smaller. This was needed for a benchmark verification
and I did not want to include a wgrib2 build in the benchmark requirements
[ ours are already too complex] so I gave up there and used a python
verification package suppled by EMC which did not use a grib2 intermediate.
…On Wed, Sep 6, 2023 at 8:37 AM Edward Hartnett ***@***.***> wrote:
We have the concept of index files. The index file contains byte offsets
into an existing GRIB2 file.
It should be noted that indexes are not likely to yield much benefit to g2
code. The g2 code does not just scan the file randomly, it finds a GRIB2
message, and reads the byte offset info it needs to correctly jump past
giant sections of data to the next message. The only time the index can
help is for files that have large amounts of non-GRIB data in the file,
between messages.
So the index can't help much. And in testing, reading and using the index
was actually slightly slower than just reading the file directly.
That said, we have indexes and have to continue to support them, while
discouraging new use.
Another problem with indexes is that they contian 32-bit values:
!> The index buffer returned contains index records with the
!> format:
!> - byte 001 - 004 length of index record
!> - byte 005 - 008 bytes to skip in data file before GRIB message
!> - byte 009 - 012 bytes to skip in message before lus (local use) set = 0, if no local section.
!> - byte 013 - 016 bytes to skip in message before gds
!> - byte 017 - 020 bytes to skip in message before pds
!> - byte 021 - 024 bytes to skip in message before drs
!> - byte 025 - 028 bytes to skip in message before bms
!> - byte 029 - 032 bytes to skip in message before data section
!> - byte 033 - 040 bytes total in the message
!> - byte 041 - 041 GRIB version number (2)
!> - byte 042 - 042 message discipline
!> - byte 043 - 044 field number within GRIB2 message
!> - byte 045 - ii identification section (ids)
!> - byte ii + 1- jj grid definition section (gds)
!> - byte jj + 1- kk product definition section (pds)
!> - byte kk + 1- ll the data representation section (drs)
!> - byte ll + 1-ll + 6 first 6 bytes of the bit map section (bms)
The problem is the second field, bytes to skip in file before message.
When a file is > 2 GB, there is no way to express this number in 4 bytes
for the messages beyond the 32-bit address boundary.
So what to do?
@Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> @GeorgeGayno-NOAA
<https://github.com/GeorgeGayno-NOAA> @GeorgeVandenberghe-NOAA
<https://github.com/GeorgeVandenberghe-NOAA> @AlexanderRichert-NOAA
<https://github.com/AlexanderRichert-NOAA> @aerorahul
<https://github.com/aerorahul> suggestions welcome.
—
Reply to this email directly, view it on GitHub
<#550>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FQOOKAWY2L5EDPDUETXZBU7RANCNFSM6AAAAAA4NIRCGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
The grib index may be used by other software, e.g. GRADS to process the g2
files.
It is better that we can skip a certain index but keep using the index file
when WMO has no definition to this.
On Wed, Sep 6, 2023 at 9:15 AM GeorgeVandenberghe-NOAA <
***@***.***> wrote:
… Index files increase performance significantly for files that are read
sparsely with many unused grib2 records in between.
IMO we should just document this 2 gbyte limit in grib2 files and move on
even though it is a major limitation. I had to use wgrib2 to
get around it and that only worked because I subsetted the grib2 file into
another one much smaller. This was needed for a benchmark verification
and I did not want to include a wgrib2 build in the benchmark requirements
[ ours are already too complex] so I gave up there and used a python
verification package suppled by EMC which did not use a grib2
intermediate.
On Wed, Sep 6, 2023 at 8:37 AM Edward Hartnett ***@***.***>
wrote:
> We have the concept of index files. The index file contains byte offsets
> into an existing GRIB2 file.
>
> It should be noted that indexes are not likely to yield much benefit to
g2
> code. The g2 code does not just scan the file randomly, it finds a GRIB2
> message, and reads the byte offset info it needs to correctly jump past
> giant sections of data to the next message. The only time the index can
> help is for files that have large amounts of non-GRIB data in the file,
> between messages.
>
> So the index can't help much. And in testing, reading and using the
index
> was actually slightly slower than just reading the file directly.
>
> That said, we have indexes and have to continue to support them, while
> discouraging new use.
>
> Another problem with indexes is that they contian 32-bit values:
>
> !> The index buffer returned contains index records with the
> !> format:
> !> - byte 001 - 004 length of index record
> !> - byte 005 - 008 bytes to skip in data file before GRIB message
> !> - byte 009 - 012 bytes to skip in message before lus (local use) set
= 0, if no local section.
> !> - byte 013 - 016 bytes to skip in message before gds
> !> - byte 017 - 020 bytes to skip in message before pds
> !> - byte 021 - 024 bytes to skip in message before drs
> !> - byte 025 - 028 bytes to skip in message before bms
> !> - byte 029 - 032 bytes to skip in message before data section
> !> - byte 033 - 040 bytes total in the message
> !> - byte 041 - 041 GRIB version number (2)
> !> - byte 042 - 042 message discipline
> !> - byte 043 - 044 field number within GRIB2 message
> !> - byte 045 - ii identification section (ids)
> !> - byte ii + 1- jj grid definition section (gds)
> !> - byte jj + 1- kk product definition section (pds)
> !> - byte kk + 1- ll the data representation section (drs)
> !> - byte ll + 1-ll + 6 first 6 bytes of the bit map section (bms)
>
>
> The problem is the second field, bytes to skip in file before message.
> When a file is > 2 GB, there is no way to express this number in 4 bytes
> for the messages beyond the 32-bit address boundary.
>
> So what to do?
>
> @Hang-Lei-NOAA <https://github.com/Hang-Lei-NOAA> @GeorgeGayno-NOAA
> <https://github.com/GeorgeGayno-NOAA> @GeorgeVandenberghe-NOAA
> <https://github.com/GeorgeVandenberghe-NOAA> @AlexanderRichert-NOAA
> <https://github.com/AlexanderRichert-NOAA> @aerorahul
> <https://github.com/aerorahul> suggestions welcome.
>
> —
> Reply to this email directly, view it on GitHub
> <#550>, or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/ANDS4FQOOKAWY2L5EDPDUETXZBU7RANCNFSM6AAAAAA4NIRCGI>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
—
Reply to this email directly, view it on GitHub
<#550 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKWSMFFSI5KCB5L3EBJRIRDXZBZNFANCNFSM6AAAAAA4NIRCGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
OK, @GeorgeVandenberghe-NOAA good point about performance. So we want to keep index files. How about we create a version 2 index file, and put some bytes at the beginning of it to indicate it's a version 2 index file. Then we can use 64-bit offsets within it, where needed. We could then change NCEPLIBS-g2 to automatically recognize whether it is reading a new index file or not. We would add new functions as needed to write the new index files, and continue to write the existing index files with existing functions. This will be fully backward compatible, but give users an easy upgrade path to use indexes with 64-bit files. |
THis sounds like a really good idea .
…On Wed, Sep 6, 2023 at 9:24 AM Edward Hartnett ***@***.***> wrote:
OK, @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA>
good point about performance. So we want to keep index files.
How about we create a version 2 index file, and put some bytes at the
beginning of it to indicate it's a version 2 index file. Then we can you
64-bit offsets within it, where needed. We could then change NCEPLIBS-g2 to
automatically recognize whether it is reading a new index file or not.
We would add new functions as needed to write the new index files, and
continue to write the existing index files with existing functions.
This will be fully backward compatible, but give users an easy upgrade
path to use indexes with 64-bit files.
—
Reply to this email directly, view it on GitHub
<#550 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FWPHI6OCGDMOLWLV5TXZB2QZANCNFSM6AAAAAA4NIRCGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
OK, the release I just release today has an skgb8() function that can handle > 2 GB files. That's all that degrib2 needs. I will take a look at this index stuff, but it won't be until the next release. This would be done in the C library and then wrappers in fortran would provide it to Fortran programmers. |
I'll just add a +1 for this issue being addressed. The SRW App would like to be able to initialize off RRFS grib2 files which are all >10 GB (Issue #660). Chgres_cube uses getgb2 to parse and read in data and that requires creating an index file. |
Hi, just wondering if there are any updates on this issue. We still need this capability in order to initialize model runs using RRFS grib2 data. Thanks |
I will be working on this starting in December and hope to have it out early in 2024... |
Sounds good, thanks for the update. |
To those who are interested in this feature, can you please provide some sample files so I can test on them? Put the sample files on hera somewhere, preferably in a subdirectory on scratch with nothing else in it. Then those files can be used for GRIB library testing, and we will then ensure that the libraries never break for those files. Which is helpful. |
@edwardhartnett Sure, I can put some sample files on Hera today. I will send you the path. Thanks |
@edwardhartnett I placed 4 sample files in this directory on Hera: /scratch2/NCEPDEV/fv3-cam/Benjamin.Blake/files_for_ed They are grib2 files from the RRFS real-time parallel which are ~7-7.5 GB, and they are on a 3-km North American rotated lat-lon grid. |
Thanks! I will build some tests around those and make sure they work OK... |
a file is on hera
/scratch1/NCEPDEV/global/gwv/BIG.G2/GFSPRS.GrbF06
This is a pressure GRIB2 file from a C1152 run of ufs-weather-model on
dogwood
…On Thu, Feb 8, 2024 at 2:13 PM Edward Hartnett ***@***.***> wrote:
To those who are interested in this feature, can you please provide some
sample files so I can test on them?
Put the sample files on hera somewhere, preferably in a subdirectory on
scratch with nothing else in it. Then those files can be used for GRIB
library testing, and we will then ensure that the libraries never break for
those files. Which is helpful.
—
Reply to this email directly, view it on GitHub
<#550 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTQTEL4Z4P6IM3OSKLYSTMPXAVCNFSM6AAAAAA4NIRCGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUGIYDMNBRHE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
OK, thanks for the test files! I am continuing to make progress here and will use your test files this week for testing the new changes... |
OK, I have changed everything to support the new index format. Everything is done in a fully backward-compatible way. However the subroutine getgb2() will generate a version 2 index, if it is not passed an index. (If it is passed an existing index file, it will transparently handle version 1 or 2.) So existing index files for GRIB2 files < 2 GB will continue to work and don't have to be regenerated. (I will change NCEPLIBS-grib_util so that it generates version 2 indexes.) |
Can those who provided test files take a look at the degrib2 output for your file? They are at: in particular look to ensure the correct number of messages is shown, and that the last message, or the last few, look reasonable. |
@edwardhartnett The RRFS prslev file looks good. |
I don't have any way to verify if my GFS pressure level grib file index is
good, short term. Downstream though I want to use the library to run
cnvgrib -g21 on this file
and then run a verification script that does not use wgrib2 off of that.
…On Thu, Feb 29, 2024 at 8:50 AM BenjaminBlake-NOAA ***@***.***> wrote:
@edwardhartnett <https://github.com/edwardhartnett> The RRFS prslev file
looks good.
—
Reply to this email directly, view it on GitHub
<#550 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FTEPIP6W4KNGTBLMITYV4YYRAVCNFSM6AAAAAA4NIRCGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZRGE4DMOJSHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
@GeorgeVandenberghe-NOAA why do you want to convert this to GRIB1? I have not checked but highly doubt any of the GRIB1 code can handle files > 2 GB. Judging from how much work it has been to get the g2 library to handle > 2 GB, that sounds like a lot of work. Add in the fact that there are no tests for w3emc, and I shudder in horror at the prospect. The conversion of g2 was made immeasurably easier by the unit tests. I can't imagine trying to do this code without them. And it seems likely that systems set up for GRIB1 will have many problems with files > 2 GB. A 4-byte int is no longer sufficient for file offsets, and that has far-reaching implications. |
OK, I have this working well, but I am pondering how to implement this in the user interface(s). RIght now the code can produce either v1 or v2 indexes. It can transparently read either v1 or v2 indexes. v1 indexes will not work for files > 2 GB, v2 indexes work for all files. (But the library does not detect any problems when a v1 index is generated on a file > 2 GB. It just generates the index for all the messages that fit in the first 2 GB.) I could:
One question that will impact this: do any users crack open the index files in their own code? That is, do users have code which parses the index file? Of do users just leave the index file contents to the g2 library and never look at them? Any input welcome. I need to decide very soon. I guess I'm leaning towards option 2, as it is the safest... |
Probably laziness. I have several grib1 converters and readers that have
just worked since 1996 (originally developed on a Cray C90) and have not
made the effort to replace them with Grib2 analogs.. To feed them I
convert grib2 files to grib1. But yeah, at some point this lengthening
chain of converters and transformers upstream should be discarded. And
it's absolutely not worth the effort to get grib1 to support >2GB record
locations.. agreed.
…On Fri, Mar 1, 2024 at 9:04 AM Edward Hartnett ***@***.***> wrote:
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> why
do you want to convert this to GRIB1? I have not checked but highly doubt
any of the GRIB1 code can handle files > 2 GB.
Judging from how much work it has been to get the g2 library to handle > 2
GB, that sounds like a lot of work. Add in the fact that there are no tests
for w3emc, and I shudder in horror at the prospect. The conversion of g2
was made immeasurably easier by the unit tests. I can't imagine trying to
do this code without them.
And it seems likely that systems set up for GRIB1 will have many problems
with files > 2 GB. A 4-byte int is no longer sufficient for file offsets,
and that has far-reaching implications.
In C, this was a whole lot easier. Fortran is not made for bit-fiddling,
templating, nor does it have much flexibility with type conversions. C
allows direct bitwise operations, and void pointers and casting allow
templating and arbitrary type conversion quite easily.
—
Reply to this email directly, view it on GitHub
<#550 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FRJCU3TZGBHIAKLKSLYWCDFDAVCNFSM6AAAAAA4NIRCGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZTGI3DEOJZG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
OK, an easier workaround would be to break files > 2 GB into files < 2 GB. Since no message can exceed 2 GB, this is always possible. Getting GRIB1 to work with > 2 GB is, we both agree, pointless. We will focus our efforts on GRIB2. |
This is working well. I will close this issue. |
Summary
The index files produce significant performance benefit for some cases when very large files are involved, Now that we are talking about > 10 GB files, no doubt this will be even more the case.
The current index file format contains a 32-bit offset, which can't handle files > 2 GB. We will add a new index format which will work in a backward-compatible way and support 62-bit file offsets.
Original Discussion
We have the concept of index files. The index file contains byte offsets into an existing GRIB2 file.
It should be noted that indexes are not likely to yield much benefit to g2 code. The g2 code does not just scan the file randomly, it finds a GRIB2 message, and reads the byte offset info it needs to correctly jump past giant sections of data to the next message. The only time the index can help is for files that have large amounts of non-GRIB data in the file, between messages.
So the index can't help much. And in testing, reading and using the index was actually slightly slower than just reading the file directly.
That said, we have indexes and have to continue to support them, while discouraging new use.
Another problem with indexes is that they contain 32-bit values:
The problem is the second field, bytes to skip in file before message. When a file is > 2 GB, there is no way to express this number in 4 bytes for the messages beyond the 32-bit address boundary.
So what to do?
@Hang-Lei-NOAA @GeorgeGayno-NOAA @GeorgeVandenberghe-NOAA @AlexanderRichert-NOAA @aerorahul suggestions welcome.
The text was updated successfully, but these errors were encountered: