Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optional disable download verification for prepro task #974

Merged
merged 1 commit into from
Mar 14, 2020

Conversation

matthiasdusch
Copy link
Member

The download verification for large files takes an absurd amount of time:
Calculating the hash for the 4.7GB Alaska DEM takes about 12 seconds on the cluster. The Alaska region has more than 27k glaciers. But because of the download lock the hash calculation is not multi-processed. That's almost 4 days of calculating the same hash over and over again :-D (I think this is as close to bitcoin mining as I'll ever get...)
Worst of all, it exceeds the cluster time limit, my process failed and it took me a while to figure this out...

In this PR I just added an argument to suppress the verification for the preprocessing task.

But I think we should avoid such large files whenever possible. @fmaussion do you think it would be possible to split this DEM similar to what you did with REMA? Or is there a license/attribution issue?

And @TimoRoth do you think there is a reasonable way to avoid calculating the same hash value over and over again?

@fmaussion
Copy link
Member

Worst of all, it exceeds the cluster time limit, my process failed and it took me a while to figure this out...

Very sorry about that. Went through the same process a while ago, and that's one of the reasons that cutting the files in parts is a good idea. That being said, I managed to create topo files for alaska based on this DEM on cluster, I wonder why its that long in your case.

do you think it would be possible to split this DEM similar to what you did with REMA? Or is there a license/attribution issue?

No, it's more the work that it takes - I'll have to create a custom shapefile to cut them

do you think there is a reasonable way to avoid calculating the same hash value over and over again?

The main issue is that we don't know beforehand which files we will have to check, and the processes threads currently have no way to give back information with the main thread. Maybe this can work? https://stackoverflow.com/questions/6832554/multiprocessing-how-do-i-share-a-dict-among-multiple-processes The idea would be to check if a file has been checked already in this session and skip the hash check if yes

@matthiasdusch
Copy link
Member Author

Very sorry about that. Went through the same process a while ago, and that's one of the reasons that cutting the files in parts is a good idea. That being said, I managed to create topo files for alaska based on this DEM on cluster, I wonder why its that long in your case.

The hash was only updated to include this DEM a couple of days ago. That was the most confusing part: I did the DEM processing with the default resolution first, which went without problems. Then I run it again with increased resolution and the Alaska DEM took forever. So first I thought it had something to do with the resolution...

No, it's more the work that it takes - I'll have to create a custom shapefile to cut them

I could also have a try and use the CopernicusDEM shapefile to cut it. This has 1x1 degree polygons.

The main issue is that we don't know beforehand which files we will have to check, and the processes threads currently have no way to give back information with the main thread. Maybe this can work? https://stackoverflow.com/questions/6832554/multiprocessing-how-do-i-share-a-dict-among-multiple-processes The idea would be to check if a file has been checked already in this session and skip the hash check if yes

yes sharing the info between the threads seems a bit tricky. Maybe smaller files is the easier and more efficient way to go (also in general). Smaller DEM tiles wont be used over and over again anyways.

@matthiasdusch matthiasdusch merged commit 2cedac3 into OGGM:master Mar 14, 2020
@matthiasdusch matthiasdusch deleted the prepro branch March 14, 2020 00:26
@fmaussion
Copy link
Member

I could also have a try and use the CopernicusDEM shapefile to cut it. This has 1x1 degree polygons

The Alaska DEM has its own projection - but I have an idea on how to do that with Salem, I'll try it out.

yes sharing the info between the threads seems a bit tricky

Using the Manager object in the SO post above seems promising, but will require some testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants