Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pagination has inconsistent results with both missing data and duplicated data #920

Closed
ElectricNroff opened this issue Nov 15, 2022 · 1 comment · Fixed by #946
Closed

Comments

@ElectricNroff
Copy link
Contributor

Summary: The production CVE Services endpoints that use pagination, such as GET /cve, produce substantially incorrect results for many realistic API calls. The root cause of the problem is probably not yet understood. The problem has major consequences for multiple Secretariat use cases, and also may disrupt the ability of large CNAs to retrieve a list of their CVE IDs via the GET /cve-id endpoint.

Note that pagination anomalies can also be encountered by people who don't understand the time values for CVE Record pagination. That is a different issue; the issue being reported here is distinct and much worse. The time_modified.lt and time_modified.gt parameters for the GET /cve endpoint are intended to find CVE Records matching values in these data fields:

time: {
created: Date,
modified: Date

These aren't necessarily the same as fields such as datePublished and dateUpdated within:
cve: Object

Accordingly, it is typically only useful to select date/time values after JSON 5.0 started to be used in production. If the CVE Record was created by mongoimport but not touched after that, then it does not have a useful time.modified date.

Because of this, one might expect that

time_modified.gt=2022-10-01T00:00:00.000Z

and

time_modified.lt=2022-12-01T00:00:00.000Z

should find largely the same set of CVE Records (i.e., all records from the Soft Deploy period). In other words, this "gt" series of GET requests:

https://cveawg.mitre.org/api/cve?time_modified.gt=2022-10-01T00:00:00.000Z
https://cveawg.mitre.org/api/cve?page=2&time_modified.gt=2022-10-01T00:00:00.000Z
https://cveawg.mitre.org/api/cve?page=3&time_modified.gt=2022-10-01T00:00:00.000Z
etc.

should collect the same set of CVE Records as this "lt" series of GET requests:

https://cveawg.mitre.org/api/cve?time_modified.lt=2022-12-01T00:00:00.000Z
https://cveawg.mitre.org/api/cve?page=2&time_modified.lt=2022-12-01T00:00:00.000Z
https://cveawg.mitre.org/api/cve?page=3&time_modified.lt=2022-12-01T00:00:00.000Z
etc.

At first glance, the results look approximately correct, because the last valid page in each series is page 8 and they both find exactly 3764 CVE IDs. The first problem is that neither the gt series nor the lt series finds 3764 unique CVE IDs. The number of unique CVE IDs varies on each attempt. For example, one time the lt series had 3539 unique CVE Records, which is 235 fewer than 3764. At that time, the page=7 response had 192 CVE Records that were also part of the page=8 response, the page=2 response had 33 CVE Records that were also part of the page=3 response, etc.

The second problem is that two sets of 3764 CVE IDs aren't the same. Specifically, when the lt series had 3539 unique CVE IDs and the gt series had 3674 unique CVE IDs (i.e., 90 fewer than 3764), there were 63 CVE Records only found by the lt requests, and 198 CVE IDs only found by the gt requests. There does not seem to be a clear pattern. For example, one CVE Record was published on 2022-11-01 and then updated on 2022-11-10: it was found only by the gt series. Another CVE Record was published on 2022-11-08 and then updated on 2022-11-09: it was found only by the lt series.

The extent of the problem varies across test runs. For example, it is possible to have 3764 CVE Records but less than 2000 unique ones.

This means that there is apparently no way to use https://cveawg.mitre.org/api/cve?time_modified.gt (accompanied by a later page=2 request) that will guarantee that all CVE Records after a certain data are captured.

Any set of found CVE Records may include a few with dateUpdated values before Soft Deploy. This occurs because clients with Secretariat privileges can use PUT /cve/{id} and modify data without bothering to supply correct dateUpdated values (e.g., the manual fix to CVE-2022-32170 during deployment because it didn't comply with the JSON 5 schema). Nobody is doing that routinely.

All of this data was collected at a time of low usage of production CVE Services, and it seems extremely unlikely that someone else created or modified a CVE Record at the moment that the traversal through the eight pages was occurring.

It appears that some or all of the problem also affects GET /cve-id pagination. (GET /org was not tested.) When testing GET /cve-id pagination, it may be necessary to add parameters to avoid a 500 Internal Server Error from CVE Services, such as:

page=2&time_modified.lt=2022-12-01T00:00:00.000Z&cve_id_year=2022&state=PUBLISHED

For example, in one case, the same CVE ID was part of the response for seven different page= values. It is possible that, on average, effects on GET /cve-id pagination are less dramatic than effects on GET /cve pagination but this has not been confirmed. Of course, GET /cve-id pagination is very important in the sense that it is available to CNAs, whereas GET /cve pagination is only for the Secretariat (but the effects on Secretariat operations are substantial).

It is possible that GET requests (that cause transactions on the same database) are somehow responsible for the incorrect pagination behavior. There is a substantial volume of GET requests to the production server 24x7x365. Because of this, it may be difficult or impossible to reproduce the problem in a non-production environment (e.g., test or prod-staging) without generating similar fake traffic.

In any case, the current software for implementing pagination (as a way to split up large data requests) clearly does not work correctly, and no part of the CVE Program should be relying on it. It is possible that the problem is in the package mongoose-aggregate-paginate-v2 itself, in how mongoose-aggregate-paginate-v2 is used by CVE Services, in how these interact with DocumentDB (rather than MongoDB), or in another area (e.g., database corruption).

Temporary workarounds might include:

  • Change src/constants/index.js to set PAGINATOR_OPTIONS.limit to a value much higher than 500. This could potentially help with some Secretariat use cases, but also potentially interfere with some CNA uses cases (e.g., the CNA calls GET /cve-id from a service/process that is not currently provisioned with enough memory for very large API responses, or very large API responses would lead to network timeouts).
  • If someone needs a large amount of data, choose a solution that relies on mongoexport rather than the API.
  • Any use case that needs a large amount of data should ensure that it runs continuously, collecting small chunks of that data. It should not be shut off and then restarted with a large data gap.
  • Try to make production CVE Services quiescent more often by redirecting some or all GET requests to a different server that has a (perhaps slightly delayed) copy of the data.
  • Check whether the problem can be eliminated by using the latest version of MongoDB, not DocumentDB. This envisions a (possibly major) architectural change in which the CVE Services Fargate tasks would rely on an EFS volume for persistent storage for MongoDB.
@github-actions github-actions bot added this to Needs Triage in Issue Triage Nov 15, 2022
@jdaigneau5 jdaigneau5 moved this from Needs Triage to High Priority in Issue Triage Nov 15, 2022
@jdaigneau5 jdaigneau5 removed this from High Priority in Issue Triage Nov 15, 2022
@mprpic
Copy link
Contributor

mprpic commented Nov 18, 2022

+1 for fixing this. Even trying to compile a list of all reserved CVE IDs that we published over all years produces wildly inconsistent lists:

$ cve list --state published | tail -n +2 | cut -d' ' -f1 > ~/temp/redhat_cves.txt
$ sort -u ~/temp/redhat_cves.txt | wc -l
6113
$ wc -l ~/temp/redhat_cves.txt 
9612

slubar added a commit that referenced this issue Nov 22, 2022
jdaigneau5 added a commit that referenced this issue Nov 22, 2022
#920 turn on debug mode for mongoose
@jdaigneau5 jdaigneau5 moved this from To Do to In Progress in Sprint 19 November 14 - November 25 Nov 22, 2022
@slubar slubar removed this from In Progress in Sprint 19 November 14 - November 25 Nov 28, 2022
@slubar slubar moved this from To Do to In Progress in Sprint 20 November 28 - December 9 Nov 30, 2022
slubar added a commit that referenced this issue Dec 6, 2022
#920 chore: remove code no longer necessary to correctly sort
jdaigneau5 added a commit that referenced this issue Dec 6, 2022
#920 chore: change sorting to use cveId instead of _id for /cve endpoint
jdaigneau5 added a commit that referenced this issue Dec 6, 2022
#920 chore: correct sort field for cve endpoint
jdaigneau5 added a commit that referenced this issue Dec 7, 2022
#920 chore: change order of aggregate query for better performance on /cve/
jdaigneau5 added a commit that referenced this issue Dec 9, 2022
#920 chore: remove debugging settings for mongoose
@slubar slubar linked a pull request Dec 9, 2022 that will close this issue
@jdaigneau5 jdaigneau5 moved this from In Progress to In Review in Sprint 20 November 28 - December 9 Dec 12, 2022
@jdaigneau5 jdaigneau5 moved this from In Review to Done in Sprint 20 November 28 - December 9 Dec 12, 2022
@slubar slubar closed this as completed Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants