Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of couchbase station queries #755

Closed
mollybsmith-noaa opened this issue Aug 25, 2022 · 13 comments
Closed

Improve speed of couchbase station queries #755

mollybsmith-noaa opened this issue Aug 25, 2022 · 13 comments
Assignees
Labels
epic A piece of functionality directly related to requirements similar to an epic user story MATS An issue affecting the MATS project server Improvements or additions to the server code

Comments

@mollybsmith-noaa
Copy link
Collaborator

Right now, the station queries in our cb-ceiling app are working well enough to validate the app and serve as a proof of concept, but they should be made faster before the app is released for production. Right now they are either on par with mysql or slightly slower, which won't be adequate for heavy process-oriented verification use.

@mollybsmith-noaa mollybsmith-noaa added server Improvements or additions to the server code epic A piece of functionality directly related to requirements similar to an epic user story MATS An issue affecting the MATS project labels Aug 25, 2022
@bonnystrong
Copy link

App currently testable, but must be done before this is in production. Defer until after scorecard in a place to move away from it for a while.

@JeffHamiltonNOAA JeffHamiltonNOAA self-assigned this Oct 5, 2022
@JeffHamiltonNOAA
Copy link
Collaborator

Adding myself and making it a high priority since this needs to be completed before we truly dive into the 3-D component

@gopa-noaa
Copy link
Contributor

I am starting with trying to understand the TimeSeries query. If would help greatly if one of you could explain in "plain english' or even better in pseudo-code what this query is trying to do.

@gopa-noaa
Copy link
Contributor

Is it OK to create some indexes on adb-cb1 ? . Just want to play with CouchBase TimeSeries query variations, but don't want to step on anyone's toes ...

@randytpierce
Copy link
Collaborator

Certainly, but it is easy for them to get out of hand, so please keep track of any changes or additions. I maintain the indexes in source code in the VxIngest repo here... adb-cb1 indexes

The standalone server adb-cb1 is the 0-replicas script but for indexes that we are controlling I always add them to the 2-replicas script as well.

For experimenting just note them and remove them if they don't help. Please keep in mind that we are also ingesting to this machine and the ingest scripts do rely on the indexing as well.

@gopa-noaa
Copy link
Contributor

Randy, I am seeing the mdatatest bucket, thanks. I will experiment with that. Planning to focus on time-series query for now. Let me know if there is a way to create another bucket where data is a map (on name key) instead of an array. Not sure if this will help query speed, but worth exploring as we discussed.

@randytpierce
Copy link
Collaborator

We now have a three prong approach to this problem. (Jeff and Gopa please correct any mis-statements about the approaches)

  1. I (randy) will create a new bucket named 'mdatatest' that has a subset of our data in it with the mdata.data portion of each data object a map instead of an array. We will see if we can make the queries operate faster on that data structure. If that works we can change the ingest and modify all the data either in place or in a new bucket or in some other way that we figure out later.
  2. Jeff is going to explore modifying our existing query by possibly breaking it up into subqueries.
  3. Gopa is going to try to explore better indexing methods.
    It makes sense to me to make these into new issues so the first one will be VxIngest issue#

@randytpierce
Copy link
Collaborator

I did create the mdatatest bucket but I haven't yet figured out how to transform the data section into a map.
I was trying this (for a single document) but it is still nested. This is where I am at trying to get a result for a single document that is formed in a map instead of an array.

SELECT m
FROM 
mdata
UNNEST mdata.data as mapdata
LET m = {mapdata.name:mapdata}
WHERE mdata.type="DD"
    AND mdata.docType='obs'
    AND mdata.version="V01"
    AND mdata.subset="METAR"
    AND mdata.fcstValidEpoch = 1636135200

@JeffHamiltonNOAA
Copy link
Collaborator

Discussion on this topic is available at https://github.com/NOAA-GSL/VxIngest/discussions/165

@gopa-noaa
Copy link
Contributor

I have the following query returning in 4.6s, looks like returning similar results to original query, but definitely needs some due diligence to be sure. note use of (non-normalized index field idx0)

@gopa-noaa
Copy link
Contributor

SELECT m.mfve AS avtime,
ARRAY_SUM(stats [].hit) AS hits,
ARRAY_SUM(stats [
].miss) AS misses,
ARRAY_SUM(stats [].false_alarm) AS fa,
ARRAY_SUM(stats [
].correct_negative) AS cn,
ARRAY_SUM(stats [*].total) AS N0
FROM (
SELECT trimData AS odata,
ofve
FROM mdatatest0._default.obs AS obs
LET ofve = obs.fcstValidEpoch,
trimData = ARRAY d FOR d IN ( [obs.stations.KEWR, obs.stations.KJFK, obs.stations.KJRB, obs.stations.KLDJ, obs.stations.KLGA, obs.stations.KNYC, obs.stations.KTEB] ) END
WHERE obs.idx0 = "DD:METAR:V01:undefined"
AND obs.fcstValidEpoch BETWEEN 1662249600 AND 1664841600 ) o,
(
SELECT trimData AS m0data,
mfve
FROM mdatatest0._default.model AS m0
LET mfve = m0.fcstValidEpoch,
trimData = ARRAY d FOR d IN ( [m0.stations.KEWR, m0.stations.KJFK, m0.stations.KJRB, m0.stations.KLDJ, m0.stations.KLGA, m0.stations.KNYC, m0.stations.KTEB] ) END
WHERE m0.idx0 = "DD:METAR:V01:HRRR_OPS"
AND m0.fcstLen = 6
AND m0.fcstValidEpoch BETWEEN 1662249600 AND 1664841600 ) m
LET stats = ARRAY( FIRST { 'hit' :CASE WHEN mv.Ceiling < 3000.0
AND ov.Ceiling < 3000.0 THEN 1 ELSE 0 END,
'miss' :CASE WHEN mv.Ceiling < 3000.0
AND NOT ov.Ceiling < 3000.0 THEN 1 ELSE 0 END,
'false_alarm' :CASE WHEN NOT mv.Ceiling < 3000.0
AND ov.Ceiling < 3000.0 THEN 1 ELSE 0 END,
'correct_negative' :CASE WHEN NOT mv.Ceiling < 3000.0
AND NOT ov.Ceiling < 3000.0 THEN 1 ELSE 0 END,
'total' :CASE WHEN mv.Ceiling IS NOT MISSING
AND ov.Ceiling IS NOT MISSING THEN 1 ELSE 0 END } FOR ov IN o.odata WHEN ov.name = mv.name END ) FOR mv IN m.m0data END
WHERE m.mfve = o.ofve

@gopa-noaa
Copy link
Contributor

mdatatest0 uses Couchbase 7.x Scope and Collection features. I am using default scope with 2 collections, obs and model

@mollybsmith-noaa mollybsmith-noaa removed their assignment Oct 26, 2022
@JeffHamiltonNOAA
Copy link
Collaborator

All plot type queries Molly provided have been converted over to the latest optimized Couchbase queries. I have compared the results against previous CB queries used in the MATS code, and they match. I have the list of queries (SQL and CB) for the various plot types in the following document:

https://docs.google.com/document/d/1ZZ6K9Yjf0wZ-XaK0McJlM08ew2tqnaIdNwEJ9dwHPak/edit?usp=sharing

This issue can be closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic A piece of functionality directly related to requirements similar to an epic user story MATS An issue affecting the MATS project server Improvements or additions to the server code
Projects
None yet
Development

No branches or pull requests

5 participants