Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with large number of datasets #106

Open
mtsales opened this issue Feb 16, 2018 · 24 comments
Open

Performance issues with large number of datasets #106

mtsales opened this issue Feb 16, 2018 · 24 comments

Comments

@mtsales
Copy link

mtsales commented Feb 16, 2018

I have been using ncWMS in several occasions and I'm very pleased with it. Thanks for the great work!
Now I have a project where we have 12 000 datasets and I'm facing some issues:
1 - The loading time after a server restart takes more than 1 hour
2 - Getting the image of a dataset is too slow (even a small dataset)

I have looked at the dynamic datasets but could not figure out how it works so I'm not sure this solves the issue.
In any case I was wondering if the issues above have a solution.

For 1 I was thinking if it is possible to implement persistence of the catalogue after the firt load of config.xml, including the update time of the datasets and when reloading the server, load the persisted catalogue and only reload the datasets that have files with modified time later than the last update in the catalogue.

Regarding 2 I noticed in the code that a hash table is used to get dataset ids based on the name mas given what I experience in terms of slow rendering time I was wondering if somewhere in the code the hash table is not used but a loop over the datasets?!?

If solving 1 and 2 is not possible, do you think dynamic datasests will solve my problem and if so provide guidance how to set them up with netcdfs files located in various folders in Windows?

@guygriffiths
Copy link
Contributor

guygriffiths commented Feb 16, 2018

It would certainly be possible to persist the catalogue data between restarts, but I don't think it's a trivial thing to do. I'd certainly consider it if we can't find an alternative solution, although I don't have a great deal of time for ncWMS development at the moment.

The slowness of getting an image is interesting and I don't know what could be causing it. How large are your datasets? I'd love to get a local setup matching yours so I could investigate further, but I'm guessing that that would not be practical unless each dataset is very small.

However, I do think that dynamic datasets could solve your problem - they are designed for the use case where you have a directory structure containing a vast amount of datasets and want them all to be accessible without individual configuration. You configure each root directory of datasets with 3 parameters:

  • Alias: This is similar to the dataset ID. It should be a short unique identifier.
  • Location: This is the root directory of the datasets you wish to expose. For example, I have a dynamic dataset with the alias local configured with the location /home/guy/Data/release_test_data/
  • Match regex: This is a regular expression which you can use to filter out which datasets are available. Leaving it as its default of .* gives no filtering (i.e. all datasets are available)

To access the dynamic datasets, you need to set the DATASET parameter to the Alias of the dynamic dataset plus the path of the dataset underneath the location. For example, on my machine /home/guy/Data/release_test_data/ has the structure:

├── 01-xyzt
│   └── synthetic_rectilinear_data.nc
├── 02-xyz
│   └── foam_one_degree-2011-01-01.nc
├── 03-xyt_aggregation
│   ├── 20100715-ostia.nc
│   ├── 20100716-ostia.nc
│   ├── 20100717-ostia.nc
│   └── 20100718-ostia.nc
...

So in GetMap requests, I can set the DATASET parameter to local/01-xyzt/synthetic_rectilinear_data.nc to get an image from that NetCDF file. Or if I want the aggregated data, I could use DATASET=local/03-xyt_aggregation/*.

More often I will use the Godiva3 interface to view the datasets, by simply passing the DATASET parameter to it - e.g. for a local server: http://localhost:8080/ncWMS2/Godiva3.html?DATASET=local/03-xyt_aggregation/*. That will produce a menu containing just a single dataset.

Dynamic datasets are slower than configured datasets, since the metadata needs to be read when each dataset is accessed. It is also not possible to list all datasets up front, so some knowledge of the data structure is necessary. For a full dynamic data catalogue, I would recommend using THREDDS. It uses the EDAL libraries to provide WMS, although the stable version (4.6) uses an old version which may lack some features you want.

@mtsales
Copy link
Author

mtsales commented Feb 16, 2018

Thanks for your prompt reply.
The datasets vary in size. From a few KB to some large ones with 60-600 MB. But getting an image even for the small ones takes long time, The total size of the setup makes it impossible to share.

I'll give it another try with the dynamic datasets and see if that solves the problem and we can find a way of keeping the knowledge of the data structure.

Besides DATASET parameter, LAYERS parameter is still required. What should be the value for LAYERS?

@guygriffiths
Copy link
Contributor

Apologies, DATASET is only required for specifying it through Godiva, but not for general WMS requests. Instead you just specify the layer as dataset/variable, e.g.:
LAYERS=local/01-xyzt/synthetic_rectilinear_data.nc/temperature.

However, DATASET is a convenient way to specify the dynamic dataset, in which case LAYERS just needs to contain the variable ID within that dataset. So for example, a GetMap request could have:
DATASET=local/01-xyzt/synthetic_rectilinear_data.nc&LAYERS=temperature - this is exactly equivalent to the above.

@mtsales
Copy link
Author

mtsales commented Feb 19, 2018 via email

@guygriffiths
Copy link
Contributor

I haven't tried it on Windows, but the first thing I'd try is changing the backslashes to slashes. Let me know if that doesn't work and I'll try and get a Windows setup to see what I can do about debugging it.

@mtsales
Copy link
Author

mtsales commented Feb 19, 2018

I made it work. My mistake, sorry. There was a missing colon after C when specifying the location

I'll test this and let you know if we can use it. Thanks for all the help so far

@mtsales
Copy link
Author

mtsales commented Feb 22, 2018

It seems we can make it work using dynamic datasets but will developing a proxy that can convert from layer names (the traditional way) to folder/file structure supported by dynamic datasets.
Implementing the improvements to support large number of normal datasets would still be relevant in the future.
Thanks a lot for your help and support

@mtsales
Copy link
Author

mtsales commented Mar 21, 2018

My datasets are updated frequently but that is not reflected on the dynamic datasets resulting in an error when trying to get data added after the first access to the file. Is there a way to update the metadata information for dynamic datasets without restarting the webapp ?

@guygriffiths
Copy link
Contributor

@mtsales - How are the datasets added to? The problem is that dynamic datasets are cached, and I have implemented some code which allows you to configure this cache and request that it is emptied, but there will still be issues if you are defining the dataset by a glob expression and adding files to it.

@mtsales
Copy link
Author

mtsales commented Apr 3, 2018

The dynamic datasets are defined with "local" alias pointing to the top most folder containing data and using .* regex expression.
Due to case sensitive issues, when requesting for datasets or layers, we use a glob expression to match cases like: local/[dD][aA][tT][aA]/[pP][eE][rR][sS][iI][aA][nN].[nN][cC]

@guygriffiths
Copy link
Contributor

I think that should be fine - there would be a potential issue if you were using a glob expression to aggregate multiple files, but using it for a single file won't interfere with emptying the dynamic dataset cache.

@mtsales
Copy link
Author

mtsales commented Apr 11, 2018

Great. How can the cache for a dataset be emptied?

@guygriffiths
Copy link
Contributor

In the admin interface, just under the dynamic datasets configuration, there is a box to configure cache settings for dynamic datasets. One of the options is a checkbox labelled "Empty cache". You'd need to check it and click the save button.

Note that this code is implemented in the develop branch. It'll be in the next release, or you can checkout develop and build from source.

@mtsales
Copy link
Author

mtsales commented Apr 11, 2018

thanks.. I'll wait for next release. Is the metadata for a dataset kept int the cache until the timestamp of a file changes and the metadata re-read when a file is newer than the one cached? Or do we have always to manually empty the cache?

@guygriffiths
Copy link
Contributor

It will need to be manually emptied. If you are updating datasets very regularly, you might be better just switching it off.

@mtsales
Copy link
Author

mtsales commented Apr 27, 2018

any tentative date for the next release?

@guygriffiths
Copy link
Contributor

Not currently. I'm working on some changes to EDAL for a separate project, and I'll do a release for that at some point in the next month or so.

@mtsales
Copy link
Author

mtsales commented May 4, 2018

Thanks for the info. I'll wait for that. I have noticed that sometimes, when using dynamic datasets, ncWMS complains about missing file (see errors below)
The files exist and are valid (I can load them using normal datasets). This is random as after waiting some time the datasets that gave problems will work correctly . Any clue of what might be happening?

2018-05-04 09:02:53 WARN WmsServlet:2742 - Wms Exception caught: "Requested menu for dataset: local/Volta/1b_TRMM/[tT][rR][mM][mM]_2018.[nN][cC] which does not exist on this server" from:uk.ac.rdg.resc.edal.wms.WmsServlet:1044

2018-05-04 09:02:53 WARN CdmUtils:439 - Using relative path for a dataset. This may cause unpredictable or platform-dependent behaviour. The use of absolute paths is recommended

2018-05-04 09:02:53 WARN WmsServlet:2742 - Wms Exception caught: "The layer local/Volta/1b_TRMM/[tT][rR][mM][mM]_2018.[nN][cC]/ was not found on this server" from:uk.ac.rdg.resc.edal.wms.util.WmsUtils:262

@guygriffiths
Copy link
Contributor

I'll have a look into it and see if I can replicate it. What is local configured as (specifically the location)?

@mtsales
Copy link
Author

mtsales commented May 14, 2018

local corresponds to E:\FloodDraughtPortal on a windows machine that contains hierarchy of subfolders with large number of datasets. I could not reproduce this using normal datasets but it is getting errors very frequently with dynamaic datasets. I removed the glob expressions and still run into problems randomly.

I got a bit further with the investigation and now I also have a stack trace (see attached)

I'm also attaching the nc file that caused this exception

DynamicDatasetsError.zip

@guygriffiths
Copy link
Contributor

The fact that's it's warning you about using a relative path suggests that it's failing to detect you're referring to a dynamic dataset, but I haven't been able to reproduce this problem. I'll be doing a new release of EDAL/ncWMS later today. Assuming you see the same issue with the new version, could you please post the whole section of the logs when the error occurs (not just the stack trace part, but anything within say a minute of it happening)?

@mtsales
Copy link
Author

mtsales commented May 16, 2018

I tested the latest release and the problem persists.

Please note I can no longer reproduce the error :
2018-05-04 09:02:53 WARN WmsServlet:2742 - Wms Exception caught: "Requested menu for dataset: local/Volta/1b_TRMM/[tT][rR][mM][mM]_2018.[nN][cC] which does not exist on this server" from:uk.ac.rdg.resc.edal.wms.WmsServlet:1044

2018-05-04 09:02:53 WARN CdmUtils:439 - Using relative path for a dataset. This may cause unpredictable or platform-dependent behaviour. The use of absolute paths is recommended

2018-05-04 09:02:53 WARN WmsServlet:2742 - Wms Exception caught: "The layer local/Volta/1b_TRMM/[tT][rR][mM][mM]_2018.[nN][cC]/ was not found on this server" from:uk.ac.rdg.resc.edal.wms.util.WmsUtils:262

since I have stopped using glob expressions.

Now what I get is random reading errors. For the same files, sometimes it reads and renders correctly, sometimes it throws exceptions as per log files attached. Usually there are 21 GetMap requests for each dataset but please note a peculiar thing in lines from 258 to 264. There are only 7 errors for this dataset and the image was only partially rendered due to these 7 errors

These logs were produced using version 2.3.1 and not the latest. With the latest release the exception is different. Please see ncwms_latest.log

ncwms_latest.log

logs.zip

@mtsales
Copy link
Author

mtsales commented Jun 13, 2018

Any news on this? Is more information needed?

@guygriffiths
Copy link
Contributor

Sorry, I haven't had time to work on this, I'm currently very busy with a number of other projects. It's still on my todo list, but ncWMS is fairly low priority at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants