# Cheat sheets and checklists

- Student name: Bart Peelman
- GitHub repo: https://github.com/BartPeelman/REPO

## Basic commands

| Task              | Command           |
| :---------------- | :---------------- |
| Change directory  | `cd DIRECTORY`    |
| List files        | `ls -l`           |
| Create directory  | `mkdir DIRECTORY` |
| Create empty file | `touch FILE`      |
| Copy file         | `cp FILE DEST`    |
| Move file         | `mv FILE DEST`    |

## Docker commands

| Task                | Command                 |
| :------------------ | :---------------------- |
| List all containers | `docker ps -a`          |
| List all images     | `docker images`         |
| Stop a container    | `docker stop CONTAINER` |
| Remove a container  | `docker rm CONTAINER`   |

## Git workflow

Simple workflow for a personal project without other contributors:

| Task                                         | Command                   |
| :------------------------------------------- | :------------------------ |
| Current project status                       | `git status`              |
| Select files to be committed                 | `git add FILE...`         |
| Commit changes to local repository           | `git commit -m 'MESSAGE'` |
| Push local changes to remote repository      | `git push`                |
| Pull changes from remote repository to local | `git pull`                |

## Checklist network configuration

1. Is the IP address correct? `ip a`
2. Is the router/default gateway correct? `ip r -n`
3. Is a DNS-server available? `cat /etc/resolv.conf`

## Docker (Lab 1 specific)

| Task                                  | Command / Note                                                                 |
| :----------------------------------- | :---------------------------------------------------------------------------- |
| Build Docker image                     | `docker build -t IMAGE_NAME .`                                                |
| Run container from image               | `docker run -p HOST_PORT:CONTAINER_PORT IMAGE_NAME`                            |
| Access container logs                  | `docker logs CONTAINER`                                                       |
| Remove image                           | `docker rmi IMAGE_NAME`                                                       |
| Push image to registry (Docker Hub)   | `docker push USERNAME/IMAGE_NAME:TAG`                                         |
| Pull image from registry               | `docker pull USERNAME/IMAGE_NAME:TAG`                                         |
| Test ML inference endpoint             | `curl -X POST -F "file=@IMAGE_FILE" http://localhost:5000/predict`           |
| List running containers                | `docker ps`                                                                   |

## Triton Inference Server

| Task                                  | Command / Note                                                                 |
| :----------------------------------- | :---------------------------------------------------------------------------- |
| Run Triton server with TensorFlow model | `docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -v MODEL_REPO:/models nvcr.io/nvidia/tritonserver:23.09-py3 tritonserver --model-repository=/models` |
| Triton HTTP endpoint test              | `curl -X POST -H "Content-Type: application/octet-stream" --data-binary @INPUT_FILE http://localhost:8000/v2/models/MODEL_NAME/versions/1/infer` |
| Model repository structure             | `/models/MODEL_NAME/1/model.savedmodel`                                        |
| Config file for model                  | `config.pbtxt` (defines input/output names, shapes, types)                    |

## Docker Compose

| Task                                   | Command / Note                                                                |
| :------------------------------------ | :---------------------------------------------------------------------------- |
| Start all services                      | `docker compose up -d`                                                        |
| Stop all services                       | `docker compose down`                                                         |
| View logs of all services               | `docker compose logs -f`                                                      |
| Build/rebuild all services              | `docker compose build`                                                        |
| Scale a service                          | `docker compose up -d --scale SERVICE=NUM`                                     |

# Monitoring & Alerting (Prometheus / Grafana / Alertmanager)

## Prometheus

| Task                              | Command / Note                              |
|-----------------------------------|--------------------------------------------|
| Access Prometheus UI              | `http://localhost:9090`                    |
| Check scrape targets              | Status → Targets                           |
| Query a metric                    | Graph → `model_result`                     |
| Reload config & rules             | `docker kill -s HUP prometheus`            |

---

## Grafana

| Task                              | Command / Note                              |
|-----------------------------------|--------------------------------------------|
| Access Grafana UI                 | `http://localhost:3000`                    |
| Default login                     | `admin / admin`                            |
| Prometheus datasource URL         | `http://prometheus:9090`                   |
| Dashboard refresh interval        | `5s`                                       |
| Dashboard time range              | Last `15 minutes`                          |
| Threshold line                    | Red line at `y = 0.75`                     |
| Import dashboard                  | Dashboards → Import                        |

---

## Alertmanager

| Task                              | Command / Note                              |
|-----------------------------------|--------------------------------------------|
| Access Alertmanager UI            | `http://localhost:9093`                    |
| Alert states                      | Inactive / Pending / Firing / Resolved     |
| Discord notifications             | Via webhook in `alertmanager.yml`          |
| Enable resolve alerts             | Default (do NOT set `send_resolved: false`) |

---

## Node Exporter (AlmaLinux VM)

| Task                              | Command                                   |
|-----------------------------------|-------------------------------------------|
| Check Node Exporter status         | `systemctl status node_exporter`          |
| Enable & start service            | `sudo systemctl enable --now node_exporter` |
| Check metrics on VM               | `curl localhost:9100/metrics`             |
| Check metrics from host           | `http://<VM-IP>:9100/metrics`             |
| Check listening port              | `ss -tulnp | grep 9100`                   |
| Check VM IP address               | `ip a`                                    |

---

## Firewall (VM)

| Task                              | Command                                   |
|-----------------------------------|-------------------------------------------|
| Open Node Exporter port            | `sudo firewall-cmd --add-port=9100/tcp --permanent` |
| Reload firewall                   | `sudo firewall-cmd --reload`              |

---

## Stress testing & monitoring (VM)

| Task                              | Command                                   |
|-----------------------------------|-------------------------------------------|
| Install stress-ng                 | `sudo dnf install stress-ng`              |
| CPU stress (2 min)                | `stress-ng --cpu 1 --timeout 2m`          |
| CPU stress (5 min)                | `stress-ng --cpu 1 --timeout 5m`          |
| Adjust CPU cores                  | Replace `1` with VM core count            |
| Basic monitoring                  | `top`                                     |
| Visual monitoring                 | `htop`                                    |
| Advanced visuals                  | `btop`                                    |

---

## Alerting concepts (theory)

| Concept        | Meaning                                                      |
|----------------|--------------------------------------------------------------|
| `for:`         | Condition must hold for duration before alert fires          |
| Grouping       | Combine similar alerts into one notification                 |
| Inhibition     | Suppress alerts if another related alert is firing           |
| Silencing      | Temporarily mute alerts manually                             |
| Resolve alert  | Notification sent when alert condition returns to normal     |




## Notes / Tips for myself

- When building Docker images, make sure your `Dockerfile` is in the same directory as your code or set the context correctly.
- Always check container logs to debug why a service isn’t starting (`docker logs CONTAINER`).
- The Triton model repository must have a separate folder per model version, and the `config.pbtxt` must match the model’s input/output.
- Docker Compose simplifies running multiple services (e.g., Triton + Flask app) at once.
- Docker containers cannot reach host services by default.
- Use `host.docker.internal` or `network_mode: host` when needed.
- Always check **Prometheus → Targets** first when metrics are missing.
- Alerts will not fire immediately if `for:` is configured.
- Node Exporter running manually ≠ systemd service working.
- Firewall rules on the VM are required for host access

