Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread-safety #13

Open
gauteh opened this issue Jan 13, 2020 · 17 comments
Open

thread-safety #13

gauteh opened this issue Jan 13, 2020 · 17 comments

Comments

@gauteh
Copy link

gauteh commented Jan 13, 2020

Does this project support thread-safe reading of HDF5 files?

@jgallagher59701
Copy link
Member

jgallagher59701 commented Jan 13, 2020 via email

@gauteh
Copy link
Author

gauteh commented Jan 13, 2020

Thanks, I've been experimenting with a lightweight, async, rust implementation of a DAP2 server. The ambition is to only support serving simple DAP, no catalog etc. Performance is similar to hyrax for sequential reads (with no caching etc), but it streams responses so does not require much memory (except caching meta-data currently). Concurrent data-reads (tested with e.g. autocannon or wrk) suffers from the global locks necessary in netcdf and HDF5 libs, while metadata is quite fast (70k requests/sec for DAS).

If hyrax has a thread-safe interface to HDF5 (at least for reads) that would greatly improve concurrent performance.

https://github.com/gauteh/dars

@jgallagher59701
Copy link
Member

jgallagher59701 commented Jan 13, 2020 via email

@gauteh
Copy link
Author

gauteh commented Jan 13, 2020

I think that for it should be safe in that context. How would this relate to using the code as an AWS Lambda function?

I haven't looked much into that. But rust works on AWS lambda. And from briefly looking at this guide they also use the tokio runtime, which is what I am using for main. So my handler routine could potentially be plugged in there (with some adaption). I don't know how AWS lambda functions access files. There is some global state stored in the memory. I suspect this would be better to keep in a separate service (redis or something) and let the lambdas fetch from there. As far as I understand AWS lambdas or e.g. cloudflare are more geared towards a microservice setup?

This is currently about 1500 lines of code, so anything is possible.

If there is any way this could be used / incorporated in the opendap ecosystem that would be very interesting.

@jgallagher59701
Copy link
Member

jgallagher59701 commented Jan 13, 2020 via email

@gauteh
Copy link
Author

gauteh commented Jan 14, 2020

On Jan 13, 2020, at 12:39, Gaute Hope @.***> wrote: I think that for it should be safe in that context. How would this relate to using the code as an AWS Lambda function? I haven't looked much into that. But rust works on AWS lambda. And from briefly looking at this guide https://aws.amazon.com/blogs/opensource/rust-runtime-for-aws-lambda/ they also use the tokio runtime, which is what I am using for main. So my handler routine could potentially be plugged in there (with some adaption). I don't know how AWS lambda functions access files. There is some global state stored in the memory. I suspect this would be better to keep in a separate service (redis or something) and let the lambdas fetch from there. As far as I understand AWS lambdas or e.g. cloudflare are more geared towards a microservice setup?

Well, I’ve been thinking of accessing data stored in S3. We have code to do that.

This is currently about 1500 lines of code, so anything is possible. If there is any way this could be used / incorporated in the opendap ecosystem that would be very interesting.

I wonder what would be the best way? I am looking for a student intern and it could be a great project, with the caveat that I know zero rust...

Absolutely! I did not start Rust too long ago, but I think for this type of service it is really well suited. Especially the safety in concurrency and memory which tend to be very difficult to get right in C++/C when writing this type of code, but still similar performance to C++. Go is probably in the same niche, but less memory safe/RC-safe.

I think that to both support an amazon lambda setup and more traditional load-balanced servers things need to be split up a bit more. But since amazon lambda also uses tokio for async functions this should be possible to do in an efficient and clean way.

@magnusuMET
Copy link

Do you require a thread-safe hdf5 installation? If not, how do you synchronize accesses when calling into hdf5?

@jgallagher59701
Copy link
Member

jgallagher59701 commented Jan 15, 2020 via email

@gauteh
Copy link
Author

gauteh commented Jan 17, 2020

: Do you require a thread-safe hdf5 installation? If not, how do you synchronize accesses when calling into hdf5?

The OPeNDAP server only reads HDF5, so there’s no need to synchronize the accesses.

If I understand correctly you have your own implementation of a HDF5 reader? Which is reasonably thread-safe for reads (e.g. no global buffers?). This is very useful, since the official HDF5 library is not thread-safe even for reads. It is relatively easy to crash or invalidate data by stress-testing the official HDF5-library.

The official HDF5 library has the option to compile with global-locking, making it thread-safe by only ensuring it to be used sequentially. This does not really help on performance, as it is the same you have to implement (relatively easily) as a user of the HDF5 (or netCDF) library which is not thread-safe. This does of course not allow concurrent access, it just synchronizes the access at library-level.

@jgallagher59701
Copy link
Member

jgallagher59701 commented Jan 17, 2020 via email

@gauteh
Copy link
Author

gauteh commented Jan 23, 2020

That would be very useful. If you find a student interested in dars. Is the primary use-case an AWS deployed service? Using S3? In our case this would be very useful as well, but a more traditional setup should also be supported (file system). Currently, metadata is cached, but XDR/Dods is not. It might not be necessary with a multi-threaded HDF5-reader, but otherwise some caching mechanism has to be supported there as well (I could not really find a good proxy to put in between, but that might be possible, it would not be able to make use of overlapping data-requests though). With a multi-threaded HDF5 the main performance gain from caching would be either from memory-caching (redis / memcached) and from not having to convert to XDR, and possibly file-system issues if the data store is slow. In any case, to be able to support these three use cases:

  • s3
  • traditional without caching (e.g. a small server only requiring a single instance of the main server)
  • traditional with caching

things need to be split up in a library in a sensible way so that possibly two or three targets can built. I think that building one target to fit these three in one will result in so independent branches that they are essentially different programs. Do you have any thoughts on this?

@gauteh
Copy link
Author

gauteh commented Jan 25, 2020

I've look at the DMR++ module trying to understand how it is built up, and I am now appropriately confused. One thing I was wondering about: it seems that everything goes through libcurl, from looking at dmrpp_module/data/README.md it seems that local files are also accessed through libcurl using file:// URLs?

The same README.md also mention something about 2 or 3 times smaller size files, does that mean that the DMR++ files also contain data? It seems like a large size still for metadata and chunk-maps (is the map file the DMR++ file?)?

It would maybe make sense for me to support DMR++ files.

@ndp-opendap
Copy link
Contributor

ndp-opendap commented Jan 25, 2020

I've look at the DMR++ module trying to understand how it is built up, and I am now appropriately confused. One thing I was wondering about: it seems that everything goes through libcurl, from looking at dmrpp_module/data/README.md it seems that local files are also accessed through libcurl using file:// URLs?

  • Yes, it uses libcurl for access.
  • Yes, file URLs work for range access of local files. This might be replaced with specific file pointer based code in an effort to improve performance, but since curl just "does it" we rolled with it in the short term.

The same README.md also mention something about 2 or 3 times smaller size files, does that mean that the DMR++ files also contain data? It seems like a large size still for metadata and chunk-maps (is the map file the DMR++ file?)?

The dmr++ files contain all of the source file's syntactic and semantic metadata in addition to the chunk-maps. In many cases the semantic metadata of NASA data products is quite large. No actual data values are stored in the dmr++ files.

Nathan

@gauteh
Copy link
Author

gauteh commented Jan 25, 2020

Thanks, that makes sense.

@ndp-opendap
Copy link
Contributor

I should have pointed out that this arrangement allows the server to construct all of the DAP2/4 metadata responses (.ddx, .dds, .dmr, etc) without interrogating the source hdf5/nc4 file.

@gauteh
Copy link
Author

gauteh commented Jan 27, 2020 via email

@jgallagher59701
Copy link
Member

jgallagher59701 commented Jan 28, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants