Skip to content

Latest commit

 

History

History
147 lines (107 loc) · 8.25 KB

README.md

File metadata and controls

147 lines (107 loc) · 8.25 KB

Monitoring AIStore Cluster

Monitoring the performance and health of AIStore is essential for maintaining the efficiency and reliability of the system. This guide provides detailed instructions on how to monitor AIStore using both command-line tools and a Kubernetes-based monitoring stack.

Monitoring - Using CLI

AIStore provides a CLI (command-line interface) with a show performance command. This command offers a snapshot of the cluster's performance, including throughput, latencies, disk IO, capacity, and more.

Monitoring - Using kube-prometheus-stack

You can setup your own k8s stack for monitoring. For a comprehensive monitoring setup, we recommend the kube-prometheus-stack helm chart. This chart installs and integrates several components:

This setup forms a foundational monitoring stack that can be extended as needed.

Node Labeling for Monitoring

Identify nodes designated for monitoring.

kubectl get nodes

Note: In larger deployments, label only the nodes allocated for monitoring.

Label these nodes accordingly:

kubectl label node/node-01 'aistore.nvidia.com/role_monitoring=true'
...

Creating a Monitoring Namespace

This namespace will house all monitoring-related nodes and services.

kubectl create ns monitoring

Deploy kube-prometheus-stack

Ensure helm is installed. If not, follow the installation guide.

  1. Install Helm using the provided script:

    curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
    chmod 700 get_helm.sh
    ./get_helm.sh
  2. Add the prometheus-community repo to Helm:

    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
  3. Customize the chart values by editing kube_prometheus_stack_values.yaml. This involves setting nodeAffinity, grafanaAdminPassword, persistent stats storage (commented), and securityContext.

  4. Customize the alert receivers in the kube_prometheus_stack_values.yaml file. AlertManager supports various receivers, and you can configure them as needed. Refer to the Prometheus Alerting Configuration for details on each receiver's config. A sample Slack receiver configuration is included and commented out in the file.

  5. For setting the securityContext, specify details of a non-root user (typically UID > 1000). To identify existing non-root users, use the following command:

    awk -F: '$3 >= 1000 {print $1}' /etc/passwd

    Alternatively, you can either use an existing non-root user or create a new one. To obtain the UID and Group ID (GID) of a user, execute:

    id [username]

    Then, update the kube_prometheus_stack_values.yaml file with the user's UID and GID by setting the runAsUser and runAsGroup fields, respectively, under securityContext. Also, don't forget to set the grafanaAdminPassword.

Important: If your monitoring nodes are labeled differently, remember to adjust the key value in the nodeAffinity configuration within the same file to match your custom label. The default setting is aistore.nvidia.com/role_monitoring=true.

Chart Deployment

Deploy the Prometheus stack with customized values:

helm install -f https://raw.githubusercontent.com/NVIDIA/ais-k8s/main/manifests/monitoring/kube_prometheus_stack_values.yaml kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring

Configuring Prometheus (Pod) Monitors

At this point, you'll have a prometheus instance running that mostly just monitors itself.

To monitor AIS, we'll need to add a couple of PodMonitor definitions.

You can find two PodMonitor definitions in the file ais_podmonitors.yaml. Apply them:

kubectl -n monitoring apply -f https://raw.githubusercontent.com/NVIDIA/ais-k8s/main/manifests/monitoring/ais_podmonitors.yaml

When applied, the monitors will configure prometheus to scrape metrics from AIStore's proxy and target pods individually every 30 seconds.

Accessing Prometheus UI

The UI is not directly accessible from outside the cluster. Options include changing the service type to NodePort or using port-forwarding:

# change `type: ClusterIP` to `type: NodePort`
kubectl patch svc kube-prometheus-stack-prometheus -n monitoring -p '{"spec": {"type": "NodePort"}}'

# or use port-forwarding:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090

Access the UI via http://localhost:<port>. Find the NodePort/port assigned to the service:

kubectl get svc kube-prometheus-stack-prometheus -n monitoring

Note: Depending on how you have configured you might need to ssh -L <port>:localhost:<port> <user-name>@<ip-or-host-name> into the machines port and view localhost:<port>.

Prometheus UI

Setting Up Grafana Dashboard

kube-prometheus-stack's grafana deployment makes use of the kiwigrid k8s sidecar image, which allows us to provide our own dashboards as Kubernetes configMaps.

A sample dashboard can be found at aistore_dashboard.yaml. Apply it:

kubectl apply -n monitoring -f https://raw.githubusercontent.com/NVIDIA/ais-k8s/main/manifests/monitoring/aistore_dashboard.yaml

Similar to Prometheus, to access Grafana dashboard you will have to either change the service type to NodePort or using port-forwarding:

# change `type: ClusterIP` to `NodePort`
kubectl patch svc kube-prometheus-stack-grafana -n monitoring -p '{"spec": {"type": "NodePort"}}'

# or, use port-forwarding
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Access the UI via http://localhost:<port>. Find the NodePort/port assigned to the service:

kubectl get svc kube-prometheus-stack-grafana -n monitoring

Note: Depending on how you have configured you might need to ssh -L <port>:localhost:<port> <user-name>@<ip-or-host-name> into the machines port and view localhost:<port>.

You'll need to use the username admin and the grafanaAdminPassword you chose earlier to log in.

Grafana Dashboard

Once logged in, you can import more dashboards to make the most of the node-exporter and kube-state-metrics deployments bundled with the chart if you wish. For detailed node and k8s related metrics we recommend this dashboard.

Setting Up AIStore-Specific Alerts

To enhance your monitoring with AIStore-specific alerts, ensure you have configured your alert receivers in the kube_prometheus_stack_values.yaml file. Once done, you can easily add AIStore-specific alerts by applying the following configuration:

kubectl apply -n monitoring -f https://raw.githubusercontent.com/NVIDIA/ais-k8s/main/manifests/monitoring/kube_prometheus_stack_aistore_rules.yaml

This command will set up additional AIStore-specific alerting rules to help you monitor and manage your AIStore environment effectively.