Monitoring the performance and health of AIStore is essential for maintaining the efficiency and reliability of the system. This guide provides detailed instructions on how to monitor AIStore using both command-line tools and a Kubernetes-based monitoring stack.
AIStore provides a CLI (command-line interface) with a show performance
command. This command offers a snapshot of the cluster's performance, including throughput, latencies, disk IO, capacity, and more.
You can setup your own k8s stack for monitoring. For a comprehensive monitoring setup, we recommend the kube-prometheus-stack helm chart. This chart installs and integrates several components:
This setup forms a foundational monitoring stack that can be extended as needed.
Identify nodes designated for monitoring.
kubectl get nodes
Note: In larger deployments, label only the nodes allocated for monitoring.
Label these nodes accordingly:
kubectl label node/node-01 'aistore.nvidia.com/role_monitoring=true'
...
This namespace will house all monitoring-related nodes and services.
kubectl create ns monitoring
Ensure helm
is installed. If not, follow the installation guide.
-
Install Helm using the provided script:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh
-
Add the prometheus-community repo to Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update
-
Customize the chart values by editing
kube_prometheus_stack_values.yaml
. This involves settingnodeAffinity
,grafanaAdminPassword
, persistent stats storage (commented), andsecurityContext
. -
Customize the alert receivers in the
kube_prometheus_stack_values.yaml
file. AlertManager supports various receivers, and you can configure them as needed. Refer to the Prometheus Alerting Configuration for details on each receiver's config. A sample Slack receiver configuration is included and commented out in the file. -
For setting the
securityContext
, specify details of a non-root user (typically UID > 1000). To identify existing non-root users, use the following command:awk -F: '$3 >= 1000 {print $1}' /etc/passwd
Alternatively, you can either use an existing non-root user or create a new one. To obtain the UID and Group ID (GID) of a user, execute:
id [username]
Then, update the
kube_prometheus_stack_values.yaml
file with the user's UID and GID by setting therunAsUser
andrunAsGroup
fields, respectively, undersecurityContext
. Also, don't forget to set thegrafanaAdminPassword
.
Important: If your monitoring nodes are labeled differently, remember to adjust the
key
value in the nodeAffinity configuration within the same file to match your custom label. The default setting isaistore.nvidia.com/role_monitoring=true
.
Deploy the Prometheus stack with customized values:
helm install -f https://raw.githubusercontent.com/NVIDIA/ais-k8s/main/manifests/monitoring/kube_prometheus_stack_values.yaml kube-prometheus-stack prometheus-community/kube-prometheus-stack --namespace monitoring
At this point, you'll have a prometheus instance running that mostly just monitors itself.
To monitor AIS, we'll need to add a couple of PodMonitor definitions.
You can find two PodMonitor
definitions in the file ais_podmonitors.yaml
. Apply them:
kubectl -n monitoring apply -f https://raw.githubusercontent.com/NVIDIA/ais-k8s/main/manifests/monitoring/ais_podmonitors.yaml
When applied, the monitors will configure prometheus to scrape metrics from AIStore's proxy and target pods individually every 30 seconds.
The UI is not directly accessible from outside the cluster. Options include changing the service type to NodePort
or using port-forwarding:
# change `type: ClusterIP` to `type: NodePort`
kubectl patch svc kube-prometheus-stack-prometheus -n monitoring -p '{"spec": {"type": "NodePort"}}'
# or use port-forwarding:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090
Access the UI via http://localhost:<port>
. Find the NodePort/port assigned to the service:
kubectl get svc kube-prometheus-stack-prometheus -n monitoring
Note: Depending on how you have configured you might need to
ssh -L <port>:localhost:<port> <user-name>@<ip-or-host-name>
into the machines port and viewlocalhost:<port>
.
kube-prometheus-stack
's grafana deployment makes use of the kiwigrid k8s sidecar image, which allows us to provide our own dashboards as Kubernetes configMaps.
A sample dashboard can be found at aistore_dashboard.yaml
. Apply it:
kubectl apply -n monitoring -f https://raw.githubusercontent.com/NVIDIA/ais-k8s/main/manifests/monitoring/aistore_dashboard.yaml
Similar to Prometheus, to access Grafana dashboard you will have to either change the service type to NodePort
or using port-forwarding:
# change `type: ClusterIP` to `NodePort`
kubectl patch svc kube-prometheus-stack-grafana -n monitoring -p '{"spec": {"type": "NodePort"}}'
# or, use port-forwarding
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
Access the UI via http://localhost:<port>
. Find the NodePort/port assigned to the service:
kubectl get svc kube-prometheus-stack-grafana -n monitoring
Note: Depending on how you have configured you might need to
ssh -L <port>:localhost:<port> <user-name>@<ip-or-host-name>
into the machines port and viewlocalhost:<port>
.
You'll need to use the username admin
and the grafanaAdminPassword
you chose earlier to log in.
Once logged in, you can import more dashboards to make the most of the node-exporter
and kube-state-metrics
deployments bundled with the chart if you wish. For detailed node and k8s related metrics we recommend this dashboard.
To enhance your monitoring with AIStore-specific alerts, ensure you have configured your alert receivers in the kube_prometheus_stack_values.yaml
file. Once done, you can easily add AIStore-specific alerts by applying the following configuration:
kubectl apply -n monitoring -f https://raw.githubusercontent.com/NVIDIA/ais-k8s/main/manifests/monitoring/kube_prometheus_stack_aistore_rules.yaml
This command will set up additional AIStore-specific alerting rules to help you monitor and manage your AIStore environment effectively.