Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUAHSI WDC catalog API search enhancements #10

Open
emiliom opened this issue Mar 20, 2018 · 8 comments
Open

CUAHSI WDC catalog API search enhancements #10

emiliom opened this issue Mar 20, 2018 · 8 comments

Comments

@emiliom
Copy link
Member

emiliom commented Mar 20, 2018

Goal: Find WDC sites & data series in larger, more useful AoI's.

Revisit choice of catalog search API requests, to explore newer ones that are faster, more flexible and more effective.

  • Problem: can only search AoI < 1500 km2
  • Arscott: need to do at least HUC10

Background / research

@emiliom
Copy link
Member Author

emiliom commented Mar 20, 2018

We will explore the new web services at CUAHSI: http://hiscentral.cuahsi.org/webservices/hiscentral.asmx

  • The new CUAHSI WDC API we'll be looking at is GetSeriesMetadataCountOrData. Here's the info about it copied directly from http://hiscentral.cuahsi.org/webservices/hiscentral.asmx
    • Provides information about metadata stored in the catalog.Tycally used to search the catalog. It can return the following info (note: The return can be defined by providing the appropriate parameters in the request.The return of this request can not exceed 25.000 timeseries):
      1. The count of timeseries that match the provided parameters.
      2. the statistics for the distribution of all facets for timeseries that match the provided parameters. e.g how many timeseries have the datatype 'average', or the keyword 'precipitation'.
      3. the complete set of all metadata records for timeseries that match the provided parameters.
  • From that same service page, here are the parameters the API request accepts: getData, getFacetOnCV, xmin, xmax, ymin, ymax, sampleMedium, dataType, valueType, generalCategory, conceptKeyword, networkIDs, beginDate, endDate
  • The API request getSeriesCatalogInBoxPaged has been deprecated. I think this was a new API (2017?) that we never used, but it's now deprecated.

@emiliom
Copy link
Member Author

emiliom commented Mar 23, 2018

Notes about how the MMW CUAHSI WDC currently operates:

  • GetServicesInBox2 is run in the background and cached for 1 week (note: we originally requested a 1-day cache)
  • GetSeriesCatalogForBox2 is used as the main search API, using the cached service results. We discussed using GetSeriesCatalogForBox3 but found no compelling advantages relative to the disadvantage of getting a much larger payload back.

Background discussions from 2017, during development:

@emiliom
Copy link
Member Author

emiliom commented Mar 23, 2018

It'd be really nice if there was a catalog API operation that excluded grid services. OR if one of the existing operations had a parameter that allowed for the exclusion of grid services.

@emiliom
Copy link
Member Author

emiliom commented Mar 25, 2018

Here are initial results from an assessment today using a jupyter notebook I'll post later. I'll post more details later, too.

Each result is for a search based on a 1° x 1° square box ("square" in lat-lon coordinates) centered at the center point listed. Search requests were issued with suds-jurko. The last 3 columns show response times (including suds processing time) for 3 API's:

  • GSCFB2 = GetSeriesCatalogForBox2 (currently used in the MMW portal)
  • GSCFB3 = GetSeriesCatalogForBox3
  • GSMCD = GetSeriesMetadataCountOrData (the newer API we're investigating)
Location latlon center AOI (km2) series count non-grid series count GSCFB2 GSCFB3 GSMCD
Texas, south of Austin 30.0, -97.5 10,707 5,288 4,488 20.5 s 53.0 s 36.9 s
Just N of the Schuykil river near Philly 40.1, -75.5 9,457 23,001 22,205 86.0 s 181.0 s 178.0 s
1° N of the above PA/DRB point 41.1, -75.5 9,317 16,744 15,944 60.0 s 110.0 s 128.0 s
Central Iowa 42.0, -93.0 9,188 1,618 818 6.77 s 12.4 s 11.2 s
Halfway between Olympia, WA and Portland, OR 46.5, -123.0 8,511 9,226 8,426 44.7 s 73.0 s 69.0 s

@emiliom
Copy link
Member Author

emiliom commented Apr 2, 2018

Just realized that the HIS API's (or at least GetSeriesMetadataCountOrData) also accept GET and POST requests, not just SOAP. I don't know if that makes any difference in performance, though.

@emiliom
Copy link
Member Author

emiliom commented Apr 2, 2018

The Jupyter notebook I used for this assessment, CUAHSI_HISCentral_AOI_service_tests.ipynb, can be accessed here. See the descriptions at the top.

This notebook was run once for each AOI listed in the table above. The specific results shown in the notebook snapshot (for the "1° N of the above PA/DRB point" AOI) differ from the ones listed in the table, because the data are dynamic and factors such as CUAHSI server loads and network latency are not constant. The results in the notebook were run today, Monday April 2 at 3:40pm PT, while results in the table above were run on Saturday March 24 (weekend server loads are probably lighter).

@emiliom
Copy link
Member Author

emiliom commented Apr 3, 2018

Extra notes I jotted down while composing the MMW issue I just created. Too much detail to include in that issue, but worth capturing here for easy reference.

  1. Adjust BigCZ max area from 1500 -> 8000 WikiWatershed/model-my-watershed#2409 tests with 8000km2; reported problems with suds
    • "I'll take a deeper look soon, but the main reason for going with 1500 for the limit was WDC searches timing out. It may be that we need to increase the timeout for those somewhere, or constrain the results otherwise (like limiting to the last five years for example) to make it finish in time, and pair that with this to be viable. I'll know more once I've taken a look."
      • My comment: The timeout was not on the WDC end. The "timeout" issue was the WDC response time exceeding a limit imposed by our application
    • "I tested this, but WDC results choke on even a 2000 sq km area of interest."
      • My comment: The choking is internal to the application handling of WDC responses
      • My comment: Comment discussed some specific code within the application that was throwing this problem, and potential solutions
  2. BiG-CZ: Increase Area of Interest Size WikiWatershed/model-my-watershed#2418 Increases size of area of interest to 8000. Limits BiG-CZ searches to the last five years by default whenever area of interest is bigger than 1500 km²."
    • "The commit 76e0eca alleviates a CPU and RAM utilization issue where a large number of CUAHSI results would fail to serialize to JSON in order to cache. Now we don't cache the CUAHSI results. However, suds itself chokes at a certain point (around 800 or so results). Thus the time limit."
      • My comment: I did not encounter any "suds" problems, even up to 23K records, and with my laptop that's not top-of-the line hardware
      • My comment: Was he using the old and deprecated (abandoned) "suds"? That package is known to have problems. Use its active fork, suds-jurko

@aufdenkampe
Copy link
Member

@emiliom, thanks for all your effort at testing, documenting, and finding likely paths to solve the WDC site search performance issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants