Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to support DT clusters for high availability (HA) and high performance #903

Open
stevespringett opened this issue Jan 25, 2021 · 10 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@stevespringett
Copy link
Member

Current Behavior:

In DT 3.8, the frontend was separated from the server. In v4.0 it was further decoupled and the UI was completely removed from the server by default. The server was rebranded API Server with the intent that other server-side components will be available in the future.

The current architecture of the API Server is monolithic which relies on an async event driven queue and task execution subsystem. Under heavly loads, the system can underperform and in some situations, bouncing the app is required.

Proposed Behavior:

Decouple various types of workers into their own projects, that can be deployed and scaled independently. The microservice architecture is not the appropriate approach for DT. But an architecture that incorporates the following will likely be ideal:

  • Frontend
  • API Server
  • Distributed event and task execution queue
  • Specialized worker nodes

SPIKE

  • Investigate the feasibility of using Redis and Redisson
  • Ideally, have Redis deployed by default, or the option to specify an external Redis instance
  • Experiment with decoupled persistence and model libraries
  • Experiment with CLI worker nodes that respond to events
  • Experiment with cluster-wide singletons where there should only be one instance running (NVD and NPM mirroring)
  • Experiment with secure key replication across cluster, or the use of Docker secrets (or K8s Secrets).
@stevespringett stevespringett added the enhancement New feature or request label Jan 25, 2021
@stevespringett stevespringett added this to the 5.0 milestone Jan 25, 2021
@stevespringett stevespringett self-assigned this Jan 25, 2021
@stevespringett
Copy link
Member Author

See also: #218

@lihaoran93
Copy link

Is there any other temporary way to solve it? The UpdatePortfoliometrics Task has been running for eight hours with 1,100 items(project)。

@stevespringett
Copy link
Member Author

@lihaoran93 If running on VMs or Docker, likely culprits are underpowered machines. Make sure you're using machines optimized for CPU and RAM and that you've given enough of both to the server. You'll also want to look at your database server, especially if its on a VM or using something like RDS. These can be underpowered as well. 1100 projects isn't that many, so the fact its taking that long leads me to believe there's a performance bottleneck somewhere on the hosts.

@lihaoran93
Copy link

Thanks, I'll check the database and CPU。

@spmishra121
Copy link

Hi @stevespringett, Is there any clear steps are defined to implement DT clusters for HA?

@spmishra121
Copy link

Can we go with below steps for the current available version?

  1. Create 2 mysql db in cluster as primary and secondary node.
  2. Use primary node db in docker-compose.yml file
  3. Install 2 instance of DT on different machine.
  4. Configure both db to sync.

@fuentecilla86
Copy link

Hi,

I am playing with HA locally (not with database yet) but I had a problem with two DT servers running. This is my docker-compose.yml:

version: '3.7'

volumes:
  dependency-track:

services:
  dtrack-apiserver:
    image: dependencytrack/apiserver:4.3.6
    environment:
    # Database Properties
    - ALPINE_DATABASE_MODE=external
    - ALPINE_DATABASE_URL=jdbc:postgresql://db:5432/dtrack
    - ALPINE_DATABASE_DRIVER=org.postgresql.Driver
    - ALPINE_DATABASE_USERNAME=dtrack
    - ALPINE_DATABASE_PASSWORD=dtrack
    - ALPINE_DATABASE_POOL_ENABLED=true
    - ALPINE_DATABASE_POOL_MAX_SIZE=20
    - ALPINE_DATABASE_POOL_MIN_IDLE=10
    - ALPINE_DATABASE_POOL_IDLE_TIMEOUT=300000
    - ALPINE_DATABASE_POOL_MAX_LIFETIME=600000
    depends_on:
      - postgres
    deploy:
      replicas: 2
      resources:
        limits:
          memory: 12288m
        reservations:
          memory: 8192m
      restart_policy:
        condition: on-failure
    # ports:
    #   - '8081:8080'
    volumes:
      - 'dependency-track:/data'
    # restart: unless-stopped
    restart: on-failure

  dtrack-frontend:
    image: dependencytrack/frontend:4.3.1
    depends_on:
      - dtrack-apiserver
    environment:
      - API_BASE_URL=http://localhost:8081
    ports:
      - "8080:8080"
    restart: unless-stopped

  db:
    image: postgres:14.2
    expose:
      - "5432"
    environment:
      - POSTGRES_USER=dtrack
      - POSTGRES_PASSWORD=dtrack
      - POSTGRES_DB=dtrack
    volumes:
      - ./docker/postgresql:/var/lib/postgresql

  nginx:
    image: nginx:latest
    volumes:
      - ./docker/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - dtrack-apiserver
    ports:
      - "8081:8081"

nginx.conf

user  nginx;

events {
    worker_connections   1000;
}
http {
        server {
              listen 8081;
              location / {
                proxy_pass http://dtrack-apiserver:8080;
              }
        }
}

The problem is when I run it, the two servers start trying to write in the same folder /data and the second one crash because of it. Is there any way to control this? Is there something that I'm not paying attention?

@LazyAnnoyingStupidIdiot

So, a few questions.

  • If it has issue with the concurrency of the data in the disk, can I just start 1 instance and add the second instance later after it's finished initialising?
  • Or, can I run this in an AWS fargate container with a postgredb backing? I'm assuming everything in the data directory can be regenerated when the api container is replaced?

@nscuro
Copy link
Member

nscuro commented Feb 23, 2023

@LazyAnnoyingStupidIdiot

If it has issue with the concurrency of the data in the disk, can I just start 1 instance and add the second instance later after it's finished initialising?

That will not work, because some data that is initialized immediately on startup will also be periodically refreshed / updated afterwards. Lucene search indexes for example (located in /data/index) are updated frequently throughout the application's lifetime.

Or, can I run this in an AWS fargate container with a postgredb backing? I'm assuming everything in the data directory can be regenerated when the api container is replaced?

The /data directory contains keys for secrets encrytion (secret.key), as well as JWT signing / validation (public.key, private.key). While those can be re-generated, it will invalidate all previously issues JWTs, and requires re-cryption of secrets, like API keys for OSS Index, GitHub, Snyk, etc.

@LazyAnnoyingStupidIdiot

@nscuro thank you for the answers. Very much appreciated.

I see you have mentioned lucene search index. That means NAS (EFS on AWS) would not be too great an idea either?

I'm really hoping for a set up where I don't have to use EC2 instance and its disk storage, but from the look of things this is unavoidable :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants