Skip to content

Commit

Permalink
Initial commit for DCGM Exporter repository
Browse files Browse the repository at this point in the history
  • Loading branch information
glowkey committed Aug 3, 2021
1 parent 5b9f6d1 commit f452488
Show file tree
Hide file tree
Showing 121 changed files with 47 additions and 38,387 deletions.
4 changes: 2 additions & 2 deletions .github/PR_TEMPLATE.md
@@ -1,9 +1,9 @@
**Please open your pull requests on [gitlab repository](https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools.git) **
**Please open your pull requests on [gitlab repository](https://gitlab.com/nvidia/dcgm-exporter.git) **

Make sure to complete the following items:_

- _A reference to a related issue._
- _A small description of the changes proposed in the pull request._
- _One commit per change and descriptive commit messages._
- _Sign-off your work following these [guidelines](https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools/blob/master/CONTRIBUTING.md) ._
- _Sign-off your work following these [guidelines](https://gitlab.com/nvidia/dcgm-exporter/blob/master/CONTRIBUTING.md) ._
- _Test run of your changes._
2 changes: 1 addition & 1 deletion .gitlab-ci.yml
Expand Up @@ -41,7 +41,7 @@ e2e:
- ssh -i aws-kube-ci/key ${instance_hostname} \
"export CI_COMMIT_SHORT_SHA=${CI_COMMIT_SHORT_SHA} &&
export CI_REGISTRY_IMAGE=${CI_REGISTRY_IMAGE} &&
cd ~/gpu-monitoring-tools && sudo -E ./tests/ci-run-e2e.sh"
cd ~/dcgm-exporter && sudo -E ./tests/ci-run-e2e.sh"

aws_kube_clean:
extends: .aws_kube_clean
Expand Down
3 changes: 0 additions & 3 deletions .gitmodules
@@ -1,3 +0,0 @@
[submodule "aws-kube-ci"]
path = aws-kube-ci
url = https://gitlab.com/nvidia/container-infrastructure/aws-kube-ci.git
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
@@ -1,6 +1,6 @@
# Contribute to the GPU Operator Project
# Contribute to the DCGM-Exporter Project

Want to hack on the NVIDIA Container Toolkit Project? Awesome!
Want to hack on the NVIDIA DCGM-Exporter Project? Awesome!
We only require you to sign your work, the below section describes this!

## Sign your work
Expand Down
23 changes: 1 addition & 22 deletions Makefile
Expand Up @@ -25,7 +25,7 @@ NON_TEST_FILES := pkg/dcgm.go pkg/gpu_collector.go pkg/parser.go pkg/pipeline.g
MAIN_TEST_FILES := pkg/system_info_test.go

.PHONY: all binary install check-format
all: ubuntu18.04 ubuntu20.04 ubi8
all: ubuntu20.04 ubi8

binary:
cd pkg; go build
Expand All @@ -43,40 +43,19 @@ check-format:

push:
$(DOCKER) push "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubuntu20.04"
$(DOCKER) push "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubuntu18.04"
$(DOCKER) push "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubi8"

push-short:
$(DOCKER) tag "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubuntu18.04" "$(REGISTRY)/dcgm-exporter:$(DCGM_VERSION)"
$(DOCKER) push "$(REGISTRY)/dcgm-exporter:$(DCGM_VERSION)"

push-ci:
$(DOCKER) tag "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubuntu18.04" "$(REGISTRY)/dcgm-exporter:$(VERSION)"
$(DOCKER) push "$(REGISTRY)/dcgm-exporter:$(VERSION)"

push-latest:
$(DOCKER) tag "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubuntu18.04" "$(REGISTRY)/dcgm-exporter:latest"
$(DOCKER) push "$(REGISTRY)/dcgm-exporter:latest"

ubuntu20.04:
$(DOCKER) build --pull \
--build-arg "GOLANG_VERSION=$(GOLANG_VERSION)" \
--build-arg "DCGM_VERSION=$(DCGM_VERSION)" \
--tag "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubuntu20.04" \
--file docker/Dockerfile.ubuntu20.04 .

ubuntu18.04:
$(DOCKER) build --pull \
--build-arg "GOLANG_VERSION=$(GOLANG_VERSION)" \
--build-arg "DCGM_VERSION=$(DCGM_VERSION)" \
--tag "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubuntu18.04" \
--file docker/Dockerfile.ubuntu18.04 .

ubi8:
$(DOCKER) build --pull \
--build-arg "GOLANG_VERSION=$(GOLANG_VERSION)" \
--build-arg "DCGM_VERSION=$(DCGM_VERSION)" \
--build-arg "VERSION=$(FULL_VERSION)" \
--tag "$(REGISTRY)/dcgm-exporter:$(FULL_VERSION)-ubi8" \
--file docker/Dockerfile.ubi8 .

34 changes: 9 additions & 25 deletions README.md
@@ -1,28 +1,12 @@
# NVIDIA GPU Monitoring Tools
# DCGM-Exporter

This repository contains Golang bindings and DCGM-Exporter for gathering GPU telemetry in Kubernetes.

**July 2021 - Update #1: The DCGM Go bindings have moved to github.com/NVIDIA/go-dcgm. The DCGM bindings in this repo are no longer maintained and will eventually be removed.

**June 2021 - NOTICE: Some of the tools in this repository are graduating to their own repos. In the next few weeks both the DCGM Go bindings and the DCGM Exporter will be migrating to github.com/NVIDIA. This will allow for independent versioning, issues, MRs, etc. Efforts will be made to review the existing MRs and issues before the migration occurs.**

## Bindings

Golang bindings are provided for the following two libraries:
- [NVIDIA Management Library (NVML)](https://docs.nvidia.com/deploy/nvml-api/nvml-api-reference.html#nvml-api-reference) is a C-based API for monitoring and managing NVIDIA GPU devices.
- [NVIDIA Data Center GPU Manager (DCGM)](https://developer.nvidia.com/dcgm) is a set of tools for managing and monitoring NVIDIA GPUs in cluster environments. It's a low overhead tool suite that performs a variety of functions on each host system including active health monitoring, diagnostics, system validation, policies, power and clock management, group configuration and accounting.

You will also find samples for both of these bindings in this repository.

## DCGM-Exporter

The repository also contains DCGM-Exporter. It exposes GPU metrics exporter for [Prometheus](https://prometheus.io/) leveraging [NVIDIA DCGM](https://developer.nvidia.com/dcgm).
The repository the contains DCGM-Exporter project. It exposes GPU metrics exporter for [Prometheus](https://prometheus.io/) leveraging [NVIDIA DCGM](https://developer.nvidia.com/dcgm).

### Quickstart

To gather metrics on a GPU node, simply start the `dcgm-exporter` container:
```
$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
$ docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.2.9-2.4.0-ubuntu18.04
$ curl localhost:9400/metrics
# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
Expand All @@ -46,7 +30,7 @@ Ensure you have already setup your cluster with the [default runtime as NVIDIA](
The recommended way to install DCGM-Exporter is to use the Helm chart:
```
$ helm repo add gpu-helm-charts \
https://nvidia.github.io/gpu-monitoring-tools/helm-charts
https://nvidia.github.io/dcgm-exporter/helm-charts
```
Update the repo:
```
Expand All @@ -63,7 +47,7 @@ Once the `dcgm-exporter` pod is deployed, you can use port forwarding to obtain


```
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/gpu-monitoring-tools/master/dcgm-exporter.yaml
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml
# Let's get the output of a random pod:
$ NAME=$(kubectl get pods -l "app.kubernetes.io/name=dcgm-exporter" \
Expand Down Expand Up @@ -95,8 +79,8 @@ Ensure you have the following:
- [DCGM installed](https://developer.nvidia.com/dcgm)

```
$ git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
$ cd gpu-monitoring-tools
$ git clone https://github.com/NVIDIA/dcgm-exporter.git
$ cd dcgm-exporter
$ make binary
$ sudo make install
...
Expand Down Expand Up @@ -153,5 +137,5 @@ Pull requests are accepted!

[Checkout the Contributing document!](CONTRIBUTING.md)

* Please let us know by [filing a new issue](https://github.com/NVIDIA/gpu-monitoring-tools/issues/new)
* You can contribute by opening a [pull request](https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools)
* Please let us know by [filing a new issue](https://github.com/NVIDIA/dcgm-exporter/issues/new)
* You can contribute by opening a [pull request](https://github.com/NVIDIA/dcgm-exporter)
1 change: 0 additions & 1 deletion RELEASE.md
@@ -1,7 +1,6 @@
# Release

This document, the release process as well as the versioning strategy for the DCGM exporter.
In the future this document will also contain information about the go bindings.

## Versioning

Expand Down
1 change: 0 additions & 1 deletion aws-kube-ci
Submodule aws-kube-ci deleted from 49dd87

0 comments on commit f452488

Please sign in to comment.