Refactor to support DT clusters for high availability (HA) and high performance #903

stevespringett · 2021-01-25T19:30:12Z

Current Behavior:

In DT 3.8, the frontend was separated from the server. In v4.0 it was further decoupled and the UI was completely removed from the server by default. The server was rebranded API Server with the intent that other server-side components will be available in the future.

The current architecture of the API Server is monolithic which relies on an async event driven queue and task execution subsystem. Under heavly loads, the system can underperform and in some situations, bouncing the app is required.

Proposed Behavior:

Decouple various types of workers into their own projects, that can be deployed and scaled independently. The microservice architecture is not the appropriate approach for DT. But an architecture that incorporates the following will likely be ideal:

Frontend
API Server
Distributed event and task execution queue
Specialized worker nodes

SPIKE

Investigate the feasibility of using Redis and Redisson
Ideally, have Redis deployed by default, or the option to specify an external Redis instance
Experiment with decoupled persistence and model libraries
Experiment with CLI worker nodes that respond to events
Experiment with cluster-wide singletons where there should only be one instance running (NVD and NPM mirroring)
Experiment with secure key replication across cluster, or the use of Docker secrets (or K8s Secrets).

stevespringett · 2021-01-25T19:55:36Z

See also: #218

lihaoran93 · 2021-03-03T06:18:13Z

Is there any other temporary way to solve it? The UpdatePortfoliometrics Task has been running for eight hours with 1,100 items(project)。

stevespringett · 2021-03-03T22:59:52Z

@lihaoran93 If running on VMs or Docker, likely culprits are underpowered machines. Make sure you're using machines optimized for CPU and RAM and that you've given enough of both to the server. You'll also want to look at your database server, especially if its on a VM or using something like RDS. These can be underpowered as well. 1100 projects isn't that many, so the fact its taking that long leads me to believe there's a performance bottleneck somewhere on the hosts.

lihaoran93 · 2021-03-04T08:57:57Z

Thanks, I'll check the database and CPU。

spmishra121 · 2022-02-16T07:56:23Z

Hi @stevespringett, Is there any clear steps are defined to implement DT clusters for HA?

spmishra121 · 2022-02-16T08:06:15Z

Can we go with below steps for the current available version?

Create 2 mysql db in cluster as primary and secondary node.
Use primary node db in docker-compose.yml file
Install 2 instance of DT on different machine.
Configure both db to sync.

fuentecilla86 · 2022-03-04T12:19:54Z

Hi,

I am playing with HA locally (not with database yet) but I had a problem with two DT servers running. This is my docker-compose.yml:

version: '3.7'

volumes:
  dependency-track:

services:
  dtrack-apiserver:
    image: dependencytrack/apiserver:4.3.6
    environment:
    # Database Properties
    - ALPINE_DATABASE_MODE=external
    - ALPINE_DATABASE_URL=jdbc:postgresql://db:5432/dtrack
    - ALPINE_DATABASE_DRIVER=org.postgresql.Driver
    - ALPINE_DATABASE_USERNAME=dtrack
    - ALPINE_DATABASE_PASSWORD=dtrack
    - ALPINE_DATABASE_POOL_ENABLED=true
    - ALPINE_DATABASE_POOL_MAX_SIZE=20
    - ALPINE_DATABASE_POOL_MIN_IDLE=10
    - ALPINE_DATABASE_POOL_IDLE_TIMEOUT=300000
    - ALPINE_DATABASE_POOL_MAX_LIFETIME=600000
    depends_on:
      - postgres
    deploy:
      replicas: 2
      resources:
        limits:
          memory: 12288m
        reservations:
          memory: 8192m
      restart_policy:
        condition: on-failure
    # ports:
    #   - '8081:8080'
    volumes:
      - 'dependency-track:/data'
    # restart: unless-stopped
    restart: on-failure

  dtrack-frontend:
    image: dependencytrack/frontend:4.3.1
    depends_on:
      - dtrack-apiserver
    environment:
      - API_BASE_URL=http://localhost:8081
    ports:
      - "8080:8080"
    restart: unless-stopped

  db:
    image: postgres:14.2
    expose:
      - "5432"
    environment:
      - POSTGRES_USER=dtrack
      - POSTGRES_PASSWORD=dtrack
      - POSTGRES_DB=dtrack
    volumes:
      - ./docker/postgresql:/var/lib/postgresql

  nginx:
    image: nginx:latest
    volumes:
      - ./docker/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - dtrack-apiserver
    ports:
      - "8081:8081"

nginx.conf

user  nginx;

events {
    worker_connections   1000;
}
http {
        server {
              listen 8081;
              location / {
                proxy_pass http://dtrack-apiserver:8080;
              }
        }
}

The problem is when I run it, the two servers start trying to write in the same folder /data and the second one crash because of it. Is there any way to control this? Is there something that I'm not paying attention?

LazyAnnoyingStupidIdiot · 2023-02-16T04:50:10Z

So, a few questions.

If it has issue with the concurrency of the data in the disk, can I just start 1 instance and add the second instance later after it's finished initialising?
Or, can I run this in an AWS fargate container with a postgredb backing? I'm assuming everything in the data directory can be regenerated when the api container is replaced?

nscuro · 2023-02-23T10:31:17Z

@LazyAnnoyingStupidIdiot

If it has issue with the concurrency of the data in the disk, can I just start 1 instance and add the second instance later after it's finished initialising?

That will not work, because some data that is initialized immediately on startup will also be periodically refreshed / updated afterwards. Lucene search indexes for example (located in /data/index) are updated frequently throughout the application's lifetime.

Or, can I run this in an AWS fargate container with a postgredb backing? I'm assuming everything in the data directory can be regenerated when the api container is replaced?

The /data directory contains keys for secrets encrytion (secret.key), as well as JWT signing / validation (public.key, private.key). While those can be re-generated, it will invalidate all previously issues JWTs, and requires re-cryption of secrets, like API keys for OSS Index, GitHub, Snyk, etc.

LazyAnnoyingStupidIdiot · 2023-02-27T23:56:17Z

@nscuro thank you for the answers. Very much appreciated.

I see you have mentioned lucene search index. That means NAS (EFS on AWS) would not be too great an idea either?

I'm really hoping for a set up where I don't have to use EC2 instance and its disk storage, but from the look of things this is unavoidable :/

stevespringett added the enhancement New feature or request label Jan 25, 2021

stevespringett added this to the 5.0 milestone Jan 25, 2021

stevespringett self-assigned this Jan 25, 2021

nscuro mentioned this issue May 23, 2023

Provide a means to disable or otherwise configure the DataNucleus L2 cache stevespringett/Alpine#493

Closed

nscuro mentioned this issue Jun 12, 2023

Bump Alpine to 2.2.3-SNAPSHOT #2818

Merged

1 task

nscuro mentioned this issue May 23, 2024

Allow to deploy multiple replicas of ApiServer DependencyTrack/helm-charts#65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to support DT clusters for high availability (HA) and high performance #903

Refactor to support DT clusters for high availability (HA) and high performance #903

stevespringett commented Jan 25, 2021

stevespringett commented Jan 25, 2021

lihaoran93 commented Mar 3, 2021

stevespringett commented Mar 3, 2021

lihaoran93 commented Mar 4, 2021

spmishra121 commented Feb 16, 2022

spmishra121 commented Feb 16, 2022

fuentecilla86 commented Mar 4, 2022

LazyAnnoyingStupidIdiot commented Feb 16, 2023

nscuro commented Feb 23, 2023

LazyAnnoyingStupidIdiot commented Feb 27, 2023

Refactor to support DT clusters for high availability (HA) and high performance #903

Refactor to support DT clusters for high availability (HA) and high performance #903

Comments

stevespringett commented Jan 25, 2021

Current Behavior:

Proposed Behavior:

SPIKE

stevespringett commented Jan 25, 2021

lihaoran93 commented Mar 3, 2021

stevespringett commented Mar 3, 2021

lihaoran93 commented Mar 4, 2021

spmishra121 commented Feb 16, 2022

spmishra121 commented Feb 16, 2022

fuentecilla86 commented Mar 4, 2022

LazyAnnoyingStupidIdiot commented Feb 16, 2023

nscuro commented Feb 23, 2023

LazyAnnoyingStupidIdiot commented Feb 27, 2023