Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Load on (PostgeSQL)DB after upgrade to 4.2.1 #1026

Closed
guruguruguru opened this issue Apr 29, 2021 · 33 comments
Closed

High Load on (PostgeSQL)DB after upgrade to 4.2.1 #1026

guruguruguru opened this issue Apr 29, 2021 · 33 comments
Assignees
Labels
defect Something isn't working p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP pending release

Comments

@guruguruguru
Copy link

Hi together,

since we upgraded to 4.2.1 we see a very high load on our db server.
The load is caused by the following statement:

SELECT $1 AS "DN_TYPE","A0"."AUTHOR","A0"."BLAKE2B_256","A0"."BLAKE2B_384","A0"."BLAKE2B_512","A0"."BLAKE3","A0"."CLASSIFIER","A0"."COPYRIGHT","A0"."CPE","A0"."DESCRIPTION","A0"."DIRECT_DEPENDENCIES","A0"."EXTENSION","A0"."FILENAME","A0"."GROUP","A0"."ID" AS "NUCORDER0","A0"."INTERNAL","A0"."LAST_RISKSCORE","A0"."LICENSE","A0"."MD5","A0"."NAME","A0"."TEXT","B0"."ACTIVE","B0"."AUTHOR","B0"."CLASSIFIER","B0"."CPE","B0"."DESCRIPTION","B0"."DIRECT_DEPENDENCIES","B0"."GROUP","B0"."ID","B0"."LAST_BOM_IMPORTED","B0"."LAST_BOM_IMPORTED_FORMAT","B0"."LAST_RISKSCORE","B0"."NAME","B0"."PUBLISHER","B0"."PURL","B0"."SWIDTAGID","B0"."UUID","B0"."VERSION","A0"."PUBLISHER","A0"."PURL","A0"."PURLCOORDINATES","C0"."FSFLIBRE","C0"."ID","C0"."LICENSEID","C0"."NAME","C0"."ISOSIAPPROVED","C0"."UUID","A0"."SHA1","A0"."SHA_256","A0"."SHA_384","A0"."SHA3_256","A0"."SHA3_384","A0"."SHA3_512","A0"."SHA_512","A0"."SWIDTAGID","A0"."UUID","A0"."VERSION" FROM "COMPONENT" "A0" INNER JOIN "PROJECT" "B0" ON "A0"."PROJECT_ID" = "B0"."ID" LEFT OUTER JOIN "LICENSE" "C0" ON "A0"."LICENSE_ID" = "C0"."ID" ORDER BY "NUCORDER0" OFFSET $2 ROWS FETCH NEXT $3 ROWS ONLY

This statements is fired ~14.000 times per hour.

The execution plan looks alright though but i am not sure what values are used for the parameters:

--------------------------------------------------------------------------------------------------------------------- Limit (cost=87.06..173.14 rows=100 width=9591) -> Nested Loop Left Join (cost=0.98..279289.71 rows=324443 width=9591) -> Nested Loop (cost=0.70..184548.91 rows=324443 width=9507) -> Index Scan using "COMPONENT_PK" on "COMPONENT" "A0" (cost=0.42..84801.68 rows=324443 width=6040) -> Index Scan using "PROJECT_PK" on "PROJECT" "B0" (cost=0.28..0.31 rows=1 width=3475) Index Cond: ("ID" = "A0"."PROJECT_ID") -> Index Scan using "LICENSE_PK" on "LICENSE" "C0" (cost=0.27..0.29 rows=1 width=88) Index Cond: ("ID" = "A0"."LICENSE_ID") (8 rows)

Do you have any idea? I was not sure if this qualifies as issue or question, if this is not an issue please close.

Regards,
Dennis

@guruguruguru guruguruguru added the question Further information is requested label Apr 29, 2021
@stevespringett
Copy link
Member

Was this statement not issued or not an issue with 4.0.0, 4.1.0, or 4.2.0?

@guruguruguru
Copy link
Author

guruguruguru commented Apr 30, 2021

Hi @stevespringett ,

sorry I forgot to mention: we upgraded from 3.8.0, before there was no issue.

If we restart the app-container the problem vanishes until about 0:20 in the morning and then it starts again.
Maybe it is just a failing configuration? A job that cannot finish?

Thanks,
Dennis

@stevespringett
Copy link
Member

I guess its possible that many nightly builds are potentially overloading DT. Another possibility is an integration (e.g. vulnerability aggregation platform or BI tool) that is trying to extract a ton of data from DT for reporting on a scheduled basis.

Vulnerability analysis of all components in DT happens every 24 hours - not at a specific time. Its based on the last time DT was started. So if its restarted at 9am, then the following day at 9am, a vulnerability analysis would be performed again. So I don't think its that.

@guruguruguru
Copy link
Author

@stevespringett Do you have any hint how to debug this? After a restart of the DT Container load stays very low until midnight. The it goes back to 30 and stays there until the next restart.

I (as DBA) could identify the above statement but I do not know much about the apllication.

Thanks a lot for helping!

@lusor
Copy link

lusor commented May 5, 2021

We found the cause of the issue: Deptrack has no stop condition (or probably a wrong one) for fetching the rows in batches. It tries to fetch rows starting from 25010000 while the table COMPONENT only has around 300k rows. The query then returns 0 results and Deptrack tries to fetch the next batch.

@stevespringett
Copy link
Member

DT relies on datanucleus for a persistence layer. The DT REST API allows for pagination. Any API call can fetch the range they want. So its possible that an integration (e.g. BI tool) is causing the issue.

The server does do some internal pagination for some calls as well, so its possible this logic is wrong. But I'll need to know where to look. WRT the application logs, where does the statement get executed? Is it executing when performing a vulnerability analysis, when its updating metrics, or some other task?

@kekkegenkai
Copy link
Contributor

kekkegenkai commented May 5, 2021

@stevespringett

after looking into the logs of the api-server container I've seen, that a "MetricsUpdateTask" is running.

`
12:37:11.220 INFO [MetricsUpdateTask] Executing metrics update for project: 9c06ec5f-dff2-46c8-88eb-2a108fb9c787

13:01:20.610 INFO [MetricsUpdateTask] Completed metrics update for project: 9c06ec5f-dff2-46c8-88eb-2a108fb9c787
`

Also we are getting an error which we have never seen before:

ERROR [Cache] >> EC.preCommit L1Cache op IS NULL!

I hope these logs are helping you.
If not, let me know where I should search for more detailed logs / which kind of logs you prefer.

@stevespringett
Copy link
Member

Thanks @kekkegenkai . I'll take a look a the metrics logic and ensure there isn't any logic flows around db pagination.

Regarding ERROR [Cache] >> EC.preCommit L1Cache op IS NULL!. This has come up before.

See:
https://datanucleus.groups.io/g/main/topic/ec_precommit_l1cache_op_is/9346765?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,1,0,9346765

DT recently moved to Java 11 inside its containers and I have not seen this issue with Java 11. But it's still possible that Java 8 may be a better choice. Try increasing the heap to a large number - 12GB or more, and ensure the host machine has enough physical RAM to accommodate.

@davidkarlsen
Copy link
Contributor

@kriss-s @Otterian

@davidkarlsen
Copy link
Contributor

davidkarlsen commented May 7, 2021

We experience this too. It hangs every day, and it seems to paginate endlessly on:

021-05-06 07:19:01.687 UTC,"postgres","postgres",344316,"10.200.8.27:44506",6093983d.540fc,64,"SELECT",2021-05-06 07:18:21 UTC,17/4143331,0,LOG,00000,"duration: 595.154 ms  execute <unnamed>: SELECT 'org.dependencytrack.model.Component' AS ""DN_TYPE"",""A0"".""AUTHOR"",""A0"".""BLAKE2B_256"",""A0"".""BLAKE2B_384"",""A0"".""BLAKE2B_512"",""A0"".""BLAKE3"",""A0"".""CLASSIFIER"",""A0"".""COPYRIGHT"",""A0"".""CPE"",""A0"".""DESCRIPTION"",""A0"".""DIRECT_DEPENDENCIES"",""A0"".""EXTENSION"",""A0"".""FILENAME"",""A0"".""GROUP"",""A0"".""ID"" AS ""NUCORDER0"",""A0"".""INTERNAL"",""A0"".""LAST_RISKSCORE"",""A0"".""LICENSE"",""A0"".""MD5"",""A0"".""NAME"",""A0"".""TEXT"",""B0"".""ACTIVE"",""B0"".""AUTHOR"",""B0"".""CLASSIFIER"",""B0"".""CPE"",""B0"".""DESCRIPTION"",""B0"".""DIRECT_DEPENDENCIES"",""B0"".""GROUP"",""B0"".""ID"",""B0"".""LAST_BOM_IMPORTED"",""B0"".""LAST_BOM_IMPORTED_FORMAT"",""B0"".""LAST_RISKSCORE"",""B0"".""NAME"",""B0"".""PUBLISHER"",""B0"".""PURL"",""B0"".""SWIDTAGID"",""B0"".""UUID"",""B0"".""VERSION"",""A0"".""PUBLISHER"",""A0"".""PURL"",""A0"".""PURLCOORDINATES"",""C0"".""FSFLIBRE"",""C0"".""ID"",""C0"".""LICENSEID"",""C0"".""NAME"",""C0"".""ISOSIAPPROVED"",""C0"".""UUID"",""A0"".""SHA1"",""A0"".""SHA_256"",""A0"".""SHA_384"",""A0"".""SHA3_256"",""A0"".""SHA3_384"",""A0"".""SHA3_512"",""A0"".""SHA_512"",""A0"".""SWIDTAGID"",""A0"".""UUID"",""A0"".""VERSION"" FROM ""COMPONENT"" ""A0"" INNER JOIN ""PROJECT"" ""B0"" ON ""A0"".""PROJECT_ID"" = ""B0"".""ID"" LEFT OUTER JOIN ""LICENSE"" ""C0"" ON ""A0"".""LICENSE_ID"" = ""C0"".""ID"" ORDER BY ""NUCORDER0"" OFFSET 6302900 ROWS FETCH NEXT 100 ROWS ONLY ",,,,,,,,,"PostgreSQL JDBC Driver","client backend"
select count(*) from "COMPONENT";

 count  
--------
 177796
(1 row)

@rylyade1
Copy link

Hi @stevespringett

could you find something in the metric logic that may cause this issue?

@stevespringett
Copy link
Member

Did anyone NOT experience this issue with 4.0 or 4.1? Is it only limited to 4.2? I haven't found anything yet, but theres a lot of places to look and the statements provided do not provide any clue.

@stevespringett
Copy link
Member

I've checked all the places in DT where query and pagination of components occur. This logic has not changed in over two years. So i'm wondering if this is a JDBC driver issue or a datanucleus issue.

Is anyone NOT using Postgres that is having this issue?

@guruguruguru
Copy link
Author

I can not think of a configuration in the JDBC driver that could cause this issue. Do you have anything in mind?

@stevespringett
Copy link
Member

@guruguruguru I don't know of anything specific to JDBC driver that would cause this issue. However, the JDBC driver and datanucleus were both updated in 4.x, whereas the pagination code in DT hasn't changed in two years.

I'm setting up a dedicated environment for postgres and will be doing some testing today with 4.3 snapshot - which I've added a bunch of debug logging to. Perhaps this will be helpful, but right now I'm flying blind.

@guruguruguru
Copy link
Author

I am happy to help/test if you need something!

@stevespringett
Copy link
Member

There are three main places where all components in the database are queried for and paginated through. They are:

  • ComponentIndexer - Reindexes the internal Lucene search engine for all component objects
  • RepositoryMetaAnalyzerTask - Queries repos to determine the latest version of a component and when it was published
  • BaseComponentAnalyzerTask - Common to all the vulnerability analyzers (OSSIndex, NPM, VulnDB, and Internal) where all components are iterated on and analyzed for known vulnerabilities

In my test environment, I've modified TaskScheduler so that portfolio metrics, vulnerability analysis, and meta analysis are constantly being triggered. All three of these are working as designed. It's paginating correctly in all three cases.

@stevespringett
Copy link
Member

One scenario that may be problematic, is in 3.x, there was a global component model, components would typically be additive. it would be rare when components would be deleted as that would have to happen intentionally.

With 4.x, the global model doesn't exist. Components are tied to specific projects. If the component is no longer used by a project, or a project is deleted, the component is deleted.

If the count was determined prior to components being deleted, then that might be a problem. But this would not explain the issue @lusor is having where it was attempting to fetch row 25,010,000 on a table that only has 300K rows.

I'm at a loss and unable to replicate.

@hostalp
Copy link

hostalp commented May 17, 2021

This occurs to our DT installation as well. We're also running it against the PostgreSQL database and we also upgraded from 3.x to 4.2 so can't comment on the 4.0 or 4.1 behavior. And in our case it doesn't happen every day like others mention, but after some 2 weeks of runtime. After todays occurrence I setup some stats collection at the DB side so hopefully we will be able to gather some useful information next time.

@kekkegenkai
Copy link
Contributor

kekkegenkai commented May 17, 2021

@hostalp

We found our that the problem (in our case) happens after bulk loading BOMs into Dependency-Track.
After talking to others this is the case after running the jenkins plugin or by scripts using the API.
I don't know how often you load boms into DT - but if you don't do it every day this might be the reason.
Maybe this information helps you to find the right logs.

@mvanini
Copy link

mvanini commented May 17, 2021

Same behavior for us, even with a fresh new installation of DT 4.2.2.

We have an intense CI activity during each night and weekend and we push new BOM via API to DT (with a limit of 30 project per hour). Projects (and consequently their BOM) has different sizes: from couple of components to many hundreds. After few days we have notice high CPU load on postgres and degradation to API response which sometimes results in timeout to the caller (CI pipeline itself).

After a restart of API, everything come back to normal.

@hostalp
Copy link

hostalp commented Jun 3, 2021

I'm just watching this occurring right now and at the DB side it's the same select that others have already reported here (SELECT 'org.dependencytrack.model.Component' AS "DN_TYPE","A0"."AUTHOR","A0"."BLAKE2B_256",...)

Interestingly, the value of the rows column (Total number of rows retrieved or affected by the statement) of the pg_stat_statements view for this select always shows 0 in these cases.

At other times (when the issue doesn't occur and the statement is executed with much lower performance impact - per_call_ms < 3 compared to approx. 522 during the occurrence of the issue) the value grows by 251 (in our case) per execution.

@stevespringett
Copy link
Member

@hostalp when you were seeing the bad behavior, what did the DT logs say? What task was it running that was generating these queries?

@hostalp
Copy link

hostalp commented Jun 3, 2021

The problem in our case is that the portfolio metrics update job takes a long time (currently almost 6 hours) to complete so that it's basically running all the time and can therefore spoil the results. It was definitely running at that time and can be seen in logs.

Going by the CPU usage history the issue started yesterday (2nd June) between 07:50-08:00.
I attached part of the log file covering this time period to show what everything started there.
dependency-track.1.zip

Based on what I put together from the available information I've got 2 comments:

  1. Portfolio metrics update job doesn't seem to be related to the issue. It started before the issue, completed later, then started and completed again - all this while the issue was still occurring:
2021-06-02 04:51:06,943 [] INFO [org.dependencytrack.tasks.MetricsUpdateTask] Executing portfolio metrics update
2021-06-02 10:34:41,276 [] INFO [org.dependencytrack.tasks.MetricsUpdateTask] Completed portfolio metrics update
2021-06-02 10:51:06,943 [] INFO [org.dependencytrack.tasks.MetricsUpdateTask] Executing portfolio metrics update
2021-06-02 16:48:23,659 [] INFO [org.dependencytrack.tasks.MetricsUpdateTask] Completed portfolio metrics update
  1. I can see start of the OssIndexAnalysisTask portfolio analysis at 07:50:56 but no completion of this run is visible in the log at all (and there was no restart which could explain that). Only at 13:50 there's another run that completes some 13.5 minutes later. Perhaps this might be related:
2021-06-02 01:50:56,634 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Analyzing portfolio
2021-06-02 02:08:18,945 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Portfolio analysis complete
2021-06-02 07:50:56,597 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Analyzing portfolio
2021-06-02 13:50:56,639 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Analyzing portfolio
2021-06-02 14:04:26,801 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Portfolio analysis complete
2021-06-02 19:50:56,639 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Analyzing portfolio
2021-06-02 20:04:16,760 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Portfolio analysis complete
2021-06-03 01:50:56,633 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Analyzing portfolio
2021-06-03 02:04:14,766 [] INFO [org.dependencytrack.tasks.scanners.OssIndexAnalysisTask] Portfolio analysis complete

@stevespringett
Copy link
Member

There's a few places in the code where iteration of all components takes place. They are:

  • ComponentIndexer
  • ProjectIndexer
  • RepositoryMetaAnalyzerTask
  • BaseComponentAnalyzerTask - used by:
    • InternalAnalysisTask
    • NpmAuditAnalysisTask
    • OssIndexAnalysisTask
    • VulnDbAnalysisTask

In 3.x when a global component model existed, iterating through all the components made sense. However, even though I have not been able to reproduce the issue, I think iterating through projects, then through the components for each projects makes a bit more sense.

Previously, the code was iterating through every 1000 components, globally. What I'm experimenting with is iterating through every 100 projects, and for each projects, getting ALL components in one shot. This should increase performance a bit, but for projects with over 1000 components, there will be a slight impact on RAM, but shouldn't be that much.

I'm working on a few other optimization techniques that may be helpful. Code will be checked in soon if anyone wants to test in a staging environment.

@bugbouncer
Copy link

bugbouncer commented Jun 7, 2021

Hi @stevespringett,
I think the issue is within the BaseComponentAnalyzerTask.
Your analysis regarding the deletion of components is correct: If a component is deleted while the analyzer Task runs, you can run in this indefinite loop state.

The fix should be easy.

This is the loop in BaseComponentAnalyzerTask that iterates over all components:

try (QueryManager qm = new QueryManager(alpineRequest)) {
 01         final long total = qm.getCount(Component.class);
            long count = 0;
 02         while (count < total) {
                final PaginatedResult result = qm.getComponents();
                analyze(result.getList(Component.class));
 03             count += result.getObjects().size();
                qm.advancePagination();
                doThrottleDelay();
        }
}
  • 01 you determine the overall number of components
  • 03 you increment the counter of processed components by the query result size

If a component is deleted while the analyzer is in progress, count will stay below total as result.getObjects().size() (line 03) yields no further results.

The solution could be an additional stop condition for the while loop (line 02):

try (QueryManager qm = new QueryManager(alpineRequest)) {
            final long total = qm.getCount(Component.class);
            long count = 0;
            boolean continue = true;
            while (count < total && continue) {
                final PaginatedResult result = qm.getComponents();
                analyze(result.getList(Component.class));
                int lastResult = result.getObjects().size();
                count += lastResult;
                continue = lastResult > 0;
                qm.advancePagination();
                doThrottleDelay();
        }
}

Alternatively: You could omit the total and continue the paging until lastResult is zero. Thereby you would cover the case of new components being added during the analyzer run.

@stevespringett
Copy link
Member

Thanks for the analysis @bugbouncer.

One other thing I noticed while digging into this issue is that when the entire portfolio is being analyzed (every 24 hours) VulnerabilityAnalysisTask fires off individual events for each analyzer, and each analyzer is performing the same paginated queries. This is obviously terribly inefficient. I'm going to move that logic out of BaseComponentAnalyzerTask and into VulnerabilityAnalysisTask which will be responsible for distributing components to the various analyzers for analysis. It already does this in some cases, but does not do this during portfolio analysis. So this change will also have the advantage of making the behavior in VulnerabilityAnalysisTask consistent. This should greatly reduce the number of round trips made to the database server.

stevespringett added a commit that referenced this issue Jun 9, 2021
…Added logic to break out of iterations when component inventory changes. This should eliminate the possibility of endless loops. Added additional logging during iterations.
@stevespringett stevespringett added defect Something isn't working p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP pending release and removed question Further information is requested labels Jun 9, 2021
@stevespringett stevespringett self-assigned this Jun 9, 2021
@stevespringett stevespringett added this to the 4.3 (Project ACL Only) milestone Jun 9, 2021
@nibiwodong
Copy link

nibiwodong commented Jun 10, 2021

DTrack Jenkins plugin may keep waiting untill get the DTrack scan processing status.
I guess maybe it's the reason of this problem.

I use dtrack in pipeline like:
dependencyTrackPublisher artifact: 'target/bom.xml', autoCreateProjects: true,synchronous: true ..........

This is Jenkins pipeline job logs:

[DependencyTrack] The artifact was successfully published. You may now navigate to http://127.0.0.1:9081/projects/ to view the results.

[DependencyTrack] Polling Dependency-Track for BOM processing status

ERROR: Polling Dependency-Track for results is taking longer than expected - polling limit exceeded

@davidkarlsen
Copy link
Contributor

@nibiwodong take a chill-pill :)

@nibiwodong
Copy link

@nibiwodong take a chill-pill :)

Thanks to BurpSuite...^_^
I've delete duplicate comments.

@mbenz89
Copy link

mbenz89 commented Jul 12, 2021

@stevespringett We are also running into performance problems due to this issue. Are there any plans in fixing this soon?

@kekkegenkai
Copy link
Contributor

The fix is now available with version 4.3.0.
I recommend to use version 4.3.1 because 4.3.0 doesn't work properly with postgres.
@stevespringett we might can close the ticket.

Thanks to everybody who was involved fixing this issue :).

@github-actions
Copy link
Contributor

github-actions bot commented Sep 3, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
defect Something isn't working p1 Critical bugs that prevent DT from being used, or features that must be implemented ASAP pending release
Projects
None yet
Development

No branches or pull requests