Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [DO NOT REVIEW] adding support for ceph admin-gateway #57535

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

rkachach
Copy link
Contributor

@rkachach rkachach commented May 17, 2024

TODO

  • review default SSL/TLS configuration
  • unify cephadm root CA management
  • fix prometheus/alertmanager access when secure_monitoring_stack is enabled
  • fix the nginx image (right now we are just using the latest stable version from Docker)

This pull request introduces a new design for Ceph applications based on a modular, service-based architecture. A new cephadm service admin-gateway based on nginx an open-source, high-performance web server known for its scalability, efficiency, and versatility. It will act as the new front-end and single point entry to the cluster, providing unified access to all Ceph applications, including the Dashboard and monitoring applications. In addition, Nginx enhances security and simplifies access management due to its robust community and high security standards.

Benefits of the new service

  • Unified Access: Centralizing access through Nginx improves security and provide a single entry to the cluster mgmt.
  • Improved user experience: User shouldn't care anymore about where each application is running (ip/host).
  • High Availability for dashboard: Nginx HA mechanisms are used to provide high availability for ceph dashboard.
  • High Availability for monitoring: Nginx HA mechanisms are used to provide high availability for monitoring.

High availability enhancements

The current cephadm/dashboard implementation lacks HA when it comes to monitoring services. Even when cephadm is able to deploy N instances of services such as grafana, prometheus or alertmanager when configuring the dashboard (using dashboard set-<service>-api-host API) it just picks the last configured daemon. In case this daemons goes down there's no automated fail-over to use redundant healthy instance. The following diagram reflects the current architecture (notice dashboard is configured to access directly the different monitoring services).

This problem is solved by using upstream HA features provided by nginx. The proposed solution makes sure of a dedicated internal server to act as rev-proxy for monitoring services. Dashboard is configured to use nginx end-points instead of using directly ip/host of the monitoring daemons. Following is a diagram of the new architecture:

As we can see in the above diagram, in the new architecture there are two servers:

  • External server: this server is responsible of attending and routing external user requests. The idea is for this server is use it also for any extra processing we would like to perform for external users such as authentication, authorization, etc. This server relies on nginx upstream feature to group the monitoring applications (by category). HA mechanism is implemented by selecting one of the available healthy servers.

  • Internal server: this server is responsible of attending and routing internal requests only. Similarly to the external case, this server relies on nginx upstream feature to provide monitoring HA this time for internal services. This server uses its own self-signed certificates to secure the communication with other internal clients.

Usage

cephadm:
ceph orch apply admin-gateway --placement=<your-destination-node>

Or by providing a detailed spec file (for custom certificates i.e):

service_type: admin-gateway
placement:
  hosts:
    - ceph-node-1
spec:
 port: 9443
 ssl_protocols:
   - TLSv1.2
   - TLSv1.3
 ssl_ciphers:
   - AES128-SHA
   - AES256-SHA
   - RC4-SHA
 ssl_certificate: |
   -----BEGIN CERTIFICATE-----
   < YOU CERT DATA HERE >
   -----END CERTIFICATE-----
 ssl_certificate_key: |
  -----BEGIN RSA PRIVATE KEY-----
   < YOU PRIV KEY DATA HERE >
  -----END RSA PRIVATE KEY-----

Example of the generated nginx config:

[root@ceph-node-1 ~]# cat /var/lib/ceph/e027a6c4-1dc9-11ef-9960-525400ad961f/admin-gateway.ceph-node-1/etc/nginx.conf 
worker_rlimit_nofile 8192;

events {
    worker_connections 4096;  ## Default: 1024
}

http {

        upstream dashboard_servers {
         server 192.168.100.100:8080;
         server 192.168.100.102:8080;
        }

        upstream grafana_servers {
         server 192.168.100.100:3000;
         server 192.168.100.102:3000;
        }

        upstream prometheus_servers {
         server 192.168.100.100:9095;
         server 192.168.100.101:9095;
        }

        upstream alertmanager_servers {
         server 192.168.100.100:9093;
         server 192.168.100.102:9093;
        }


    server {

        listen		    443 ssl;
	listen		    [::]:443 ssl;
	ssl_certificate	    /etc/nginx/ssl/nginx.crt;
	ssl_certificate_key /etc/nginx/ssl/nginx.key;
	ssl_protocols	    TLSv1.2 TLSv1.3;
	ssl_prefer_server_ciphers on;

        location / {
            proxy_pass http://dashboard_servers;
	    proxy_next_upstream error timeout invalid_header http_500 http_502 http_503 http_504;
        }

        location /grafana {
            rewrite ^/grafana/(.*) /$1 break;
            proxy_pass https://grafana_servers;
        }

        location /prometheus {
            proxy_pass http://prometheus_servers;
        }

        location /alertmanager {
            proxy_pass http://alertmanager_servers;
        }

    }

    server {

	listen              29443 ssl;
        listen              [::]:29443 ssl;
        ssl_certificate     /etc/nginx/ssl/nginx_internal.crt;
        ssl_certificate_key /etc/nginx/ssl/nginx_internal.key;
        ssl_protocols       TLSv1.2 TLSv1.3;
        ssl_ciphers         AES128-SHA:AES256-SHA:RC4-SHA:DES-CBC3-SHA:RC4-MD5;
        ssl_prefer_server_ciphers on;

        location /internal/grafana {
            rewrite ^/internal/grafana/(.*) /$1 break;
            proxy_pass https://grafana_servers;
        }

        location /internal/prometheus {
            rewrite ^/internal/prometheus/(.*) /prometheus/$1 break;
            proxy_pass http://prometheus_servers;
        }

        location /internal/alertmanager {
	    rewrite ^/internal/alertmanager/(.*) /alertmanager/$1 break;
            proxy_pass http://alertmanager_servers;
        }

    }
}

Contribution Guidelines

  • To sign and title your commits, please refer to Submitting Patches to Ceph.

  • If you are submitting a fix for a stable branch (e.g. "quincy"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

  • When filling out the below checklist, you may click boxes directly in the GitHub web UI. When entering or editing the entire PR message in the GitHub web UI editor, you may also select a checklist item by adding an x between the brackets: [x]. Spaces and capitalization matter when checking off items this way.

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows
  • jenkins test rook e2e

@phlogistonjohn
Copy link
Contributor

phlogistonjohn commented May 17, 2024

I will not review the patch but I will take this opportunity to bike-shed the name a little bit: cluster-gateway doesn't make it clear that this is mainly for stuff like the dashboard and grafana, etc. ui-gateway might be clearer but too narrow. How about admin-gateway? Or management-gateway?

@rkachach
Copy link
Contributor Author

I will not review the patch but I will take this opportunity to bike-shed the name a little bit: cluster-gateway doesn't make it clear that this is mainly for stuff like the dashboard and grafana, etc. ui-gateway might be clearer but too narrow. How about admin-gateway? Or management-gateway?

@phlogistonjohn no worries and the I think it's a good time to dicuss about the service name :)

The rev-proxy is includes monitoring stack (prometheus, alertmanager, grafana, ..). In addition, it can include any service app that we would run for cluster mgmt in the future. The ideal name I think could be "ingress" (to be aligned with k8s) but it's already in use as you know. I'm OK with going with another main as long as it describes better what's the purpose of the service 👍

@rkachach rkachach force-pushed the fix_issue_66095 branch 2 times, most recently from ee4f57f to cab9d77 Compare May 21, 2024 08:44
@rkachach rkachach changed the title [WIP] [DO NOT REVIEW] adding new cephadm service cluster-gateway [WIP] [DO NOT REVIEW] adding support for ceph admin-gateway May 21, 2024
@rkachach rkachach force-pushed the fix_issue_66095 branch 5 times, most recently from 0495dd3 to 9a01460 Compare May 28, 2024 11:13
@rkachach rkachach force-pushed the fix_issue_66095 branch 4 times, most recently from 6cd2f1f to 3f7b6f0 Compare May 31, 2024 08:01
@github-actions github-actions bot added the mgr label May 31, 2024
@rkachach rkachach force-pushed the fix_issue_66095 branch 2 times, most recently from 92b71e3 to 1a9436e Compare June 3, 2024 08:38
Fixes: https://tracker.ceph.com/issues/66095

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
adding support for dynamic re-configuration. admin-gateway should be
reconfigured automatically anytime there's a change on alertmanager,
grafana or prometheus since url_prefix of these services depends on
the presence (or not) of the admin-gateway. Similarly, these services
must be reconfigured whenever the admin-gateway is added or removed.

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
there should be only once instance of admin-gateway service

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
adding support to populate nginx configuration automatically by adding
all the currently active mgrs. Nginx redirection mechanism is used to
choose automatically the active mgr instance. This way, we redirect
the user to the right instance in case of mgr failover.

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
so far, we've been using only the first instance of each
monitoring service (e.g., alertmanager, prometheus, etc)
to configure nginx locations. In real deployments, multiple
instances of each service may be active simultaneously.
This change uses nginx's 'upstream' feature to configure
all running instances as backends. This allows nginx to
automatically choose a healthy instance to process
incoming requests.

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
so far the implementation was relying on a complex redirection
mechansim. The new mechaism makes use of nginx backends feature to
define a set of available dashboard servers. This way nginx
automatically can pick up the active server.

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
when the adming-gateway is removed we have to restore the dashboard
default configuration for standby behavior

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
certificates (and private key) must not be generated when https is
disabled. Additionally, grafana protocol must be the same as the
rev-proxy.

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
let's use 80 and 443 as default ports. Use can customize the port by
using the spec.port option.

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
@rkachach rkachach force-pushed the fix_issue_66095 branch 2 times, most recently from 214e7ae to e26b4ad Compare June 4, 2024 08:18
Signed-off-by: Redouane Kachach <rkachach@ibm.com>
secure_monitoring_stack dependency is added so whenever the value of
this configuration variable is changed we reconfigure the nginx to use
the corresponding protocol.

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
@rkachach
Copy link
Contributor Author

rkachach commented Jun 4, 2024

jenkins retest this please

so far we have been using only one file with all the
configuration. This has the benefit of maintaining everything within
the same file but it can be really complex when more advanced
configuration is added for authentication for example. The new
approach splits the configuration into three files: main, external and
internal server configuration for better maintainability.

Signed-off-by: Redouane Kachach <rkachach@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants