-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename embeddings file to include MGRS code and store GeoTIFF source_url #86
Conversation
Passing the URL or path of the GeoTIFF file through the datapipe, and into the model's prediction loop. The geopandas.GeoDataFrame now has an extra 'source_url' string column, and this is saved to the GeoParquet file too.
For each MGRS code (e.g. 12ABC), save a GeoParquet file with a name formatted like `{MGRS:5}_v{VERSION:2}.gpq`, e.g. 12ABC_v01.gpq. Have updated the unit test to check that rows with different MGRS codes are saved to different files.
Using ZStandard compression instead of Parquet's default Snappy compression. Should result is slightly smaller filesizes, and slightly faster data transfer and compression (especially over the network). Also changed an assert statement to an if-then-raise instead.
Speed up embedding generation by enabling multiple workers to fetch and load mini-batches of GeoTIFF files independently, and run the prediction. The prediction or generated embeddings from each worker (a geopandas.GeoDataFrame) is then concatenated together row-wise, before getting passed to the GeoParquet output script. This is done via LightningModule's `on_predict_epoch_end` hook. Also documented these new processing steps in the docstring.
# Output to a GeoParquet filename like {MGRS:5}_v{VERSION:2}.gpq | ||
outpath = f"{outfolder}/{mgrs_code}_v01.gpq" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattpaul, are you ok with a filename like 12ABC_v01.gpq
? We could also add a prefix like embedding_
or MGRS_
if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This convention looks good. Will all the chips for a given MGRS tile be contained in a single GeoParquet file? Assuming something like date or build number goes into outfolder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'll stick with that filename convention then!
Will all the chips for a given MGRS tile be contained in a single GeoParquet file? Assuming something like date or build number goes into
outfolder
?
Yes, each file (e.g. 32VLM_v01.gpq
) would contain all the embeddings for every chip within that MGRS tile. It will also contain multiple dates, so you could have overlapping chips in any one spatial area, due to images taken at different dates. Easiest way might be to show you how this looks like in QGIS:
I've put 75% transparency for each green chip, and you'll notice that some chip areas are lighter in colour (so only 1 embedding from 1 date), whereas others are darker in colour from multiple overalps (2 embeddings from 2 dates).
Not sure what you mean by build number. Do you mean the _v01
part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, cool, I see... the animated gif helps, thanks!
@weiji14 from our previous discussion around having {DATE}
in the file name I thought that embeddings from the same tile/chip geometry taken at different dates would be serialized into separate vector embedding files which might be preferable considering we may want to incrementally add to the set of vector embeddings over time.
@leothomas thoughts re: pros vs cons of combining multiple embeddings from different dates / times into a single GeoParquet export file per tile across all time? I don't know if we plan to ever want to train the model on all 7 years of available Sentinel-2 data but as a thought experiment for the sake of argument that would yield a rather large file that increases in size over time. Contrast that with an incremental approach in which the number of GeoParquet file objects in S3 would grow over time as the model learns more and more data but the individual file size is expected to remain static which makes it easier to scale ingestion horizontally.
re: build number - I am thinking about the model training lifecycle. If we zoom out so to speak and consider Clay's roadmap over time w.r.t. model training considering the availability of new Sentinel data products, etc. we will likely want to re-train the model periodically over time.
Perhaps once a month or once a quarter we may want to refresh the training dataset with the latest data available from Sentinel-1,2 or perhaps train on new sensor data as well in addition to the Sentinel data products.
We will want to be able to distinguish between old builds or old model version vs new versions because parts of our distributed system will need to continue working with old version of data, embeddings and map image tile assets while the next version is being processed.
Hence we'll want something like a build number or model version to uniquely identify each along the way. Similar to how OpenAI has tagged versions of GPT, etc.
I assumed that _v01
was meant to track our embedding file format (which has already changed quite a few times if I'm not mistaken), or was that intended to track model version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weiji14 from our previous discussion around having
{DATE}
in the file name I thought that embeddings from the same tile/chip geometry taken at different dates would be serialized into separate vector embedding files which might be preferable considering we may want to incrementally add to the set of vector embeddings over time.@leothomas thoughts re: pros vs cons of combining multiple embeddings from different dates / times into a single GeoParquet export file per tile across all time? I don't know if we plan to ever want to train the model on all 7 years of available Sentinel-2 data but as a thought experiment for the sake of argument that would yield a rather large file that increases in size over time. Contrast that with an incremental approach in which the number of GeoParquet file objects in S3 would grow over time as the model learns more and more data but the individual file size is expected to remain static which makes it easier to scale ingestion horizontally.
Just to do some quick math. One embedding file for an MGRS tile (roughly 3 dates, ~1000 rows in total) is about 3.0MB. Let's say there are 73 dates in a year (365/5-day revisit), and 7 years (2017-2023), so
What about if we named the file like {MGRS}_{MINDATE}_{MAXDATE}_{VERSION}.gpq
, e.g. 32VLM_20170517_20221119_v01.gpq
? Storing 1 date per file would result in too many files, but we can store a range of dates at least (e.g. annual, every two years, etc). Using a range-based {MINDATE}_{MAXDATE}
naming scheme would:
- Allow someone to quickly find the relevant files to process based on a YYYYMMDD date just by looking at the filenames, and
- Be fairly compatible with any changes in the range of dates we want to store in one file, compared to say, a naming scheme with just one date like
{YYYYMMDD}
.
re: build number - I am thinking about the model training lifecycle. If we zoom out so to speak and consider Clay's roadmap over time w.r.t. model training considering the availability of new Sentinel data products, etc. we will likely want to re-train the model periodically over time.
Perhaps once a month or once a quarter we may want to refresh the training dataset with the latest data available from Sentinel-1,2 or perhaps train on new sensor data as well in addition to the Sentinel data products.
We will want to be able to distinguish between old builds or old model version vs new versions because parts of our distributed system will need to continue working with old version of data, embeddings and map image tile assets while the next version is being processed.
Hence we'll want something like a build number or model version to uniquely identify each along the way. Similar to how OpenAI has tagged versions of GPT, etc.
I assumed that
_v01
was meant to track our embedding file format (which has already changed quite a few times if I'm not mistaken), or was that intended to track model version?
Gotcha, so we want a way to track both model versions, and schema revisions to the embedding file itself. I was maybe thinking of using v01
to track both, but we could probably do either:
- In the filename, have two parts like
{MODEL-VERSION}-{SCHEMA-REVISION}
, e.g._vit-1-abcde_v001.gpq
, wherevit-1-abcde
would mean Vision Transformer model 1, andabcde
signifies the hash of the trained model, andv001
being the schema version (e.g. what the columns and metadata inside the parquet file look like). The three digits could almost be SemVer-like, so 001 would mean schema v0.0.1. - Have just
{MODEL_VERSION}
in the filename e.g._vit-1-abcde.gpq
, and store the schema revision number internally in the Parquet file's metadata.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick back of the napkin math. While there probably isn't a hard ceiling on what constitutes too big for our primary use case (vector ingestion service), remember too that folks might want to open/work with these GeoParquet files from within jupyter notebooks or browser based javascript environments with resource limitations.
I would say yes let's include both {MODEL-VERSION}
, {SCHEMA-REVISION}
in parquet metadata for sure. It might also make sense to include {MODEL-VERSION}
in the filename, or rather, the pathname like so:
{outfolder}/{MODEL-VERSION}/{MGRS_CODE}.gpq
We could add DATE
to the pathname as well, perhaps after MODEL is best:
{MODEL-VERSION}/{DATE}/{MGRS_CODE}.gpq
I like your idea around date range. Hopefully it your code wouldn't have to do a table scan of all embeddings to determine MINDATE, MAXDATE... remember that could be 1.5G!
If date range is available / doable, that could look like:
{MODEL-VERSION}/{MINDATE}_{MAXDATE}/{MGRS_CODE}.gpq
Yeah, that's looking good to me!
Which begs the question - what are the typical use cases we foresee w.r.t. exporting vector embeddings? I imagine during development you might hard-code things to only work with a subset such as one-tile, but how do we expect to do things in production on a regular basis?
Would export be something done on demand by an engineer/operator from the command line or API request, etc. specifying parameters for desired MGRS tile(s) or date range?
Also, an aside, are we generating the GeoParquet file to scratch disk space and then uploading to S3? or is it generated in memory? The reason I ask is that when I make "file name" suggestions I am specifically thinking of the S3 object's file name to be clear. Feel free to name things on disk any which way you like or makes the most sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... or does it make more sense for {MGRS}
to come before {DATE}
?
Not sure how often someone will be browsing the S3 bucket we use for ingestion to look for a specific MGRS tile's embedding but that might be a viable use case for data scientists in which case they know the MGRS code but not the date range a priori, in which case yeah, MGRS should probably come first left-to-right.
{MODEL-VERSION}/{MGRS}/{MINDATE}_{MAXDATE}.gpq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. To summarize real quick, I think we should go for {MGRS}_{MINDATE}_{MAXDATE}.gpq
for the filename would be best for now, and {MODEL-VERSION}
will definitely be used at the folder level (i.e. a path like {MODEL-VERSION}/{MGRS)_{MINDATE})_{MAXDATE}.gpq
. I'll need to have a think about whether to include the model version and schema revision in the filename too, but will do that in a separate PR.
Which begs the question - what are the typical use cases we foresee w.r.t. exporting vector embeddings? I imagine during development you might hard-code things to only work with a subset such as one-tile, but how do we expect to do things in production on a regular basis?
Would export be something done on demand by an engineer/operator from the command line or API request, etc. specifying parameters for desired MGRS tile(s) or date range?
Also, an aside, are we generating the GeoParquet file to scratch disk space and then uploading to S3? or is it generated in memory? The reason I ask is that when I make "file name" suggestions I am specifically thinking of the S3 object's file name to be clear. Feel free to name things on disk any which way you like or makes the most sense.
Just to be clear, the neural network model is not likely something we update frequently, since it costs a lot of time and money to train a Foundation Model on lots of data. At such, you might only see a new model version every 3 months, or every 6 months, depending on what budget there is.
As for generating the embeddings (aka the embedding factory), this can be done in multiple ways, depending on what vector database you've decided upon, and what downstream applications you have in mind
- Currently we are processing each MGRS tile in batches, and doing a bulk export of embeddings to a GeoParquet file. For applications such as similarity search where you need a vector database of indexed embeddings to search over, a bulk batch method like this makes sense. E.g. if you want to find images of all the solar panels in the world.
- Alternatively, if you want to generate embeddings on demand via an API request, that may not even require setting up a vector databse capable of storing millions of rows. For example, if you want to visualize embeddings for 1 MGRS tile over 5 years in a time-series, it might be better to set up something like HuggingFace's Inference API to do this, and just get a hundred rows out.
So my question is, are you looking at bulk scale embedding generation (100k+ rows), or small scale embedding generation (100s of rows)?
# Output to a GeoParquet filename like {MGRS:5}_v{VERSION:2}.gpq | ||
outpath = f"{outfolder}/{mgrs_code}_v01.gpq" | ||
_gdf: gpd.GeoDataFrame = gdf.loc[mgrs_codes == mgrs_code] | ||
_gdf.to_parquet(path=outpath, schema_version="1.0.0", compression="ZSTD") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the Parquet compression from the default 'SNAPPY' to 'ZSTD' here. I ran some quick benchmarks using the code at https://github.com/jtmiclat/geoparquet-compression-benchmark/blob/main/benchmark.py, and it seems like ZSTD results in a smaller file size and slightly faster read speeds (decompression).
We can also tweak the compression ratio and other options (see https://arrow.apache.org/docs/13.0/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table), but this seems to be good enough for now, given that we'll be ingesting these GeoParquet files into some vector database like Postgres+pgvector.
Gonna merge this in first, and work on integrating with the new model (#47) next. |
…url (#86) * 🗃️ Store source_url of GeoTIFF to GeoParquet file Passing the URL or path of the GeoTIFF file through the datapipe, and into the model's prediction loop. The geopandas.GeoDataFrame now has an extra 'source_url' string column, and this is saved to the GeoParquet file too. * 🚚 Save one GeoParquet file for each unique MGRS tile For each MGRS code (e.g. 12ABC), save a GeoParquet file with a name formatted like `{MGRS:5}_v{VERSION:2}.gpq`, e.g. 12ABC_v01.gpq. Have updated the unit test to check that rows with different MGRS codes are saved to different files. * ⚡ Save GeoParquet file with ZSTD compression Using ZStandard compression instead of Parquet's default Snappy compression. Should result is slightly smaller filesizes, and slightly faster data transfer and compression (especially over the network). Also changed an assert statement to an if-then-raise instead. * ♻️ Predict with multiple workers and gather results to save Speed up embedding generation by enabling multiple workers to fetch and load mini-batches of GeoTIFF files independently, and run the prediction. The prediction or generated embeddings from each worker (a geopandas.GeoDataFrame) is then concatenated together row-wise, before getting passed to the GeoParquet output script. This is done via LightningModule's `on_predict_epoch_end` hook. Also documented these new processing steps in the docstring.
What I am changing
embeddings_0.gpq
to a format like{MGRS}_v{VERSION}.gpq
as suggested at Send early sample of embeddings #35 (comment)s3://.../.../claytile_32VLM_20221119_v02_0200.tif
, for better provenanceHow I did it
In the LightningDataModule's datapipe, return a
source_url
for each GeoTIFF file being loadedIn the LightningModule's
predict_step
, create asource_url
column in thegeopandas.GeoDataFrame
(in addition to the previous three columns done at Save embeddings with spatiotemporal metadata to GeoParquet #73). A sample table would look like this:The
source_url
column is stored in thestring[pyarrow]
format (which will be the default in Pandas 3.0 per PDEP10)Each row would store the embeddings for a single 512x512 chip, and the entire table could realistically store N rows for an entire MGRS tile (10000x1000) across different dates.
TODO in this PR:
source_url
column to GeoParquet file{MGRS}_{VERSION}.gpq
TODO in the future:
How you can test it
s3://clay-tiles-02/02/
us-east-1
where the GeoTIFF files are stored):32VLM_v01.gpq
under thedata/embeddings/
folderpython trainer.py predict --help
To load the embeddings from the geoparquet file:
Related Issues
Follow-up to #73, addresses #35 (comment)