# Prometheus

## Introduction
> Prometheus is an open-source monitoring and alerting toolkit for gathering and processing data locally. 

It was originally developed at SoundCloud and is currently a fully open-source, community-driven project. Prometheus is designed to collect and store metrics as time-series data. It stores metric information with a timestamp (when it was recorded), alongside key-value pairs called labels.

To obtain a relatively broad overview of Prometheus and its uses, visit the main page [here](https://prometheus.io/docs/introduction/overview/).

*Facts to note about Prometheus*
- It is written in golang (Go), a programming language developed at Google.
- It provides APIs for different languages, including Python.
- Prometheus' query language, PromQL, can be employed to run queries.
- It hosts a dashboard in web browsers on `localhost:9090` by default.

## Main Components

* **Prometheus server:** for scraping and storing time-series data.
* **Client libraries:** for instrumenting application code.
* **Push gateway:** for scraping metrics of short-lived jobs.
* **Alertmanager:** for handling alerts.
* **Exporters:** support scraping metrics for third-party services, such as HAProxy, StatsD and Graphite.

![](images/prometheus_architecture.png)

## Process-Flow Overview

As shown above,
- Prometheus scrapes data (e.g. metrics)
    - from short-lived jobs via `push gateway`.
    - from long-running jobs directly.
- __All samples (values with timestamps) are stored locally,__ together with the necessary metadata.
- Prometheus runs predefined rules on collected data:
    - It gathers and aggregates new records.
    - It processes them and sends alerts.
- Consumers use Prometheus API to visualise data.

## Applications of Prometheus

- __Recording numerical time series__, including
    - various training metrics gathered across epochs/batches.
    - hardware-related data.
    - network traffic and other statistics.
- __Debugging infrastructure during network outages for example__ (if interested, you can read about Slack's outage [here](https://slack.engineering/slacks-outage-on-january-4th-2021/)).

> __Data is stored locally, independent of the network storage.__

Consequently, if a failure occurs, you always have access to the data, as each Prometheus server is self-contained and independent.

__Do not use Prometheus if__ you will be working with highly detailed data coming in fast__ (per request metrics, with many requests from users, and exact billing when every millisecond counts).

This is because
- Prometheus is designed to scrape data every few seconds.
- the data are stored in the local storage, which may be insufficient (in this case, consider employing large, remote storage and data rotation).

## Data

> __All Prometheus data are stored as timestamped time series, differentiated by the metric and label (optional).__

- Metric names and labels should be alphanumerical.
- __Samples are `float64` (`double`) numerical types.__
- __Timestamps are in milliseconds.__

For instance, `process_cpu_seconds_total`, which calculates the total CPU usage of our Prometheus instance, will be displayed as

`process_cpu_seconds_total{instance="localhost:9090", job="prometheus"}`.

*Breakdown*
- `process_cpu_seconds_total` represents the metric name.
- `instance="localhost:9090"` is the label that informs us of the instance for which the CPU time is being checked.
- `job=prometheus` is another label that informs us of the job for which the CPU time is being checked.

## Metrics

> Prometheus provides `4` metrics by default.

_Note: The Prometheus server does not differentiate between metrics; it only keeps the data. Metrics are used by client libraries (`4` client libraries provided for `golang`, `java`, __`python`__, `ruby`)._

### Counter

> Provides a monotonically increasing value. It can also be restarted.

It is useful for
- determining the number of requests,
- determining the number of tasks completed,
- anything that can only increase or start anew.

### Gauge

> Provides a single value that can increase or decrease.

It is useful for
- memory-usage monitoring,
- temperature monitoring,
- anything whose value can change arbitrarily.

### Histogram

> Samples observations and groups them in configurable buckets.

For demonstration, we name our histogram metric `our_super_histogram`. In this case, the following operations on the histogram are available (notice that we added suffixes to the metric name):
- `our_super_histogram_bucket{le="<upper inclusive bound>"}`: cumulative counters for the observation buckets.
- `our_super_histogram_sum`: sum of all values.
- `out_super_histogram_count`: count of observations.

### Summary

> Samples observations and provides them as a sliding time window.

- Provides `sum` and `count` similar to the case with histogram.
- `out_super_summary_quantiles{quantile="value"}`: quantile of observations (this can achieved for the histogram metric using functions).

### Functions

Functions can be run in queries for things such as `day_of_month()`. We will cover functions in more detail later.

## Jobs and Instances

- An instance is an endpoint, from which data can be scraped, e.g. an EC2 instance or a Docker container.
- A job is a collection of the same instances, e.g. multiple EC2 instances or Docker containers.

For more information regarding jobs and instances, consult the Prometheus documentation [here](https://prometheus.io/docs/concepts/jobs_instances/).

## First Contact

### Installation

> Prometheus was written in `golang`; hence, __it is contained in a single compiled executable.__

Due to the above, its installation and deployment are simple and can be performed efficiently via many different approaches.

- Go to [the download page](https://prometheus.io/download/), and download the specific version of Prometheus for your OS. (For Mac users, download the file for the OS labelled, Darwin).
- Alternatively, run Prometheus inside a Docker container.

We will use the last option. Since we are running Prometheus from a docker container, we need to bind the Prometheus config file to the container. This will allow Prometheus to update the config file using Docker commands. First, pull the Prometheus Docker image from Docker Hub using `docker pull prom/prometheus`. Thereafter, create a `prometheus.yml` file using the following simple code:

In [None]:
global:
  scrape_interval: 15s # By default, scrape targets every 15 seconds.
  # Attach these labels to any time series or alert when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it is Prometheus itself.
scrape_configs:
  # The job name added as a label `job=<job_name>` to any timeseries scraped
  - job_name: 'prometheus'
    # Override the global default and scrape targets from the job every 5 seconds.
    scrape_interval: '5s'
    static_configs:
      - targets: ['localhost:9090']


Subsequently, we can build the image using the command below. However, you have to change the `/path/to/prometheus.yml/` path to the directory path in which your `prometheus.yml` config file is stored locally (watch out for file paths with spaces).

 Additionally, we need to add the `--web.enable-lifecycle` flag, as it enables the reloading of the config from the command line. 
 
 Run the below command in your CLI.

In [None]:
sudo docker run --rm -d \
    --network=host \
    --name prometheus\
    -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
    prom/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --web.enable-lifecycle 

### Docker options
- `-d`: runs the container in detached mode. This will run your container in the background, thereby freeing up the terminal.
- `--config.file`: the path where the `prometheus.yml` file will be found in the Docker container.
- `-rm`: removes the container when you kill the process.
- `--name`: specifies the name of the container.
- `-v`: allows us to mount the prometheus config in the container to your local config in order to edit the config freely.
- `-network="host`: maps the ports on the local machine to the ports inside the Docker container for when the localhost host is specified in `prometheus.yml`. 

At this point, Prometheus should be running; thus, __simply go to [`localhost:9090`](http://localhost:9090) in your browser,__ and you should see the following Prometheus dashboard.

We recommend selecting local time so that the metrics are logged in your time zone. Additionally, select the query history so that you can track all queries made. Finally, configure dark mode (top-right corner) for the best appearance.

<img src="images/prom_init_dash.png?modified=232132453">

Once you begin typing in the expression field, Prometheus will suggest queries. Some metrics will not be configured to be viewable currently, although Prometheus will track them. Therefore, we could start by running the expressions beginning with prometheus. 

For instance, run ``prometheus_build_info``, and execute the query. Notice that you can view detailed information about the query when you hover over it from the dropdown list.

<img src="images/prom_expression_selected_dashboard.png?modified=232132453">

The results of the query are provided in the panel underneath. In this case, the result was 

`prometheus_build_info{branch="HEAD", goversion="go1.17.1", instance="localhost:9090", job="prometheus", revision="b30db03f35651888e34ac101a06e25d27d15b476", version="2.30.2"}`.

- **metric name:** `prometheus_build_info` 
- **labels:** goversion, instance, job, etc.

A test list of all metrics currently being logged by the Prometheus server can be found at `http://localhost:9090/metrics.`

## Prometheus Configuration

### Configuration file

> __The Prometheus server is configured via [YAML](https://en.wikipedia.org/wiki/YAML) files.__ The configuration file is used to define the metrics and instances we intend to scrape. 

Previously, we ran Prometheus __with the default configuration file__ (click [`here`](http://localhost:9090/config) to view it) or by navigating to **Status > Configuration** from the dashboard.

To obtain a better understanding of the config file, consider the example below.

In [None]:
# Section with default values
global:
  scrape_interval: 15s # How frequently to scrape targets from jobs
  scrape_timeout: 10s # If there is no response from the instance, do not attempt to scrape.
  evaluation_interval: 15s # How frequently to evaluate rules (e.g. reload graphs with new data).
# Prometheus alert manager (left for now).
alerting:
  alertmanagers:
  - follow_redirects: true
    scheme: http
    timeout: 10s
    api_version: v2
    static_configs:
    - targets: []
# Specific configuration for jobs
scrape_configs:
- job_name: prometheus # Name of the job (can be anything).
  honor_timestamps: true # Use timestamps provided by the job.
  scrape_interval: 15s # Same as before, but for this job.
  scrape_timeout: 10s # ^
  metrics_path: /metrics # Where metrics are located w.r.t. the port (localhost:9090/metrics).
  scheme: http # Configures the protocol scheme used for requests (localhost is http).
  follow_redirects: true
  static_configs:
  - targets:
    - localhost:9090

> __Prometheus provides numerous configuration options. Click [here](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) to view all of them.__

*Things to note about the config file*
- __Amazingly, Prometheus automatically reloads its configuration based on the config file. Therefore, you can make live changes__ (you can make live changes or even force it to reload using the `/-/reload` flag).
- __If the file is not correctly formatted, it will not be updated__ (a valid **YAML** file is required).
- All the parameters specified under `global` are available to all other scraper configs defined in the configuration file. In other words, if you create `scrape_config`, which retrieves metrics from EC2 instances, it will use the global parameters, unless explicitly changed.

By adding the `--web.enable-lifecycle` flag when creating the Docker container and mounting our local `prometheus.yml` file in the container, we should be able to make live changes to the config. To demonstrate, we change the default `scrape_interval` and `scrape_timeout` to 1 s. The start of your config should now be as follows:

In [None]:
# Section with default values
global:
  scrape_interval: 1s # How frequently to scrape targets from jobs
  scrape_timeout: 1s # If there is no response from the instance, do not attempt to scrape.
  evaluation_interval: 15s # How frequently to evaluate rules (e.g. reload graphs with new data).
# Prometheus alert manager (left for now).

Now, we can run the `/-/reload` command to update the config while the Prometheus server is running. Once you have made the changes, run the following command in your terminal (use bash for this command):

In [None]:
curl -X POST http://localhost:9090/-/reload

Now, go to `localhost:9090/config`, and examine the config. It should be updated, and you should be able to view the changes made.

### [scrape_configs](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config)

> This specifies the __sets of targets__ (i.e. what should be scraped) and parameters, describing how to perform the scraping action for each target. For instance, one can specify a `scrape_config` to scrape metrics from an EC2 instance or a config file to scrape metrics from Kubernetes.

Targets can be defined __statically__ or __dynamically.__

- `statically`: configured in our configuration YAML in `scrape_config`.
- `dynamically`: using __service-discovery configurations__. By defining the service-discovery options, we can allow Prometheus to track new instances of a particular service.<br>
    For example, in the industry, docker containers may be started, stopped, and created constantly based on the needs of the business.<br>
    If the service-discovery configuration is defined, Prometheus will be able to track any newly created docker containers without having to specify them directly in the configuration file.

Many service-discovery integrations are available for commonly used services:
- [`consul_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#consul_sd_config): retrieve scraping targets from HashiCorp's [Consul](https://www.consul.io/) used for service discovery and network setup.
- [`dockerswarm_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#dockerswarm_sd_config): used with [Docker's Swarm mode](https://docs.docker.com/engine/swarm/), which allows us to connect and orchestrate many containers as one application.
- [`ec2_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#ec2_sd_config): retrieve scrape targets from EC2 instances.
- [`kubernetes_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config): configure scrape targets from Kubernetes REST API.


The full documentation on creating configs can be found [here](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#static_config).

### Command line

> Instance of the Prometheus server itself can be configured using command-line flags.

Run the command below to see the available options:

In [None]:
docker container run --rm prom/prometheus --help

The most notable options are

- `--web.read-timeout=5m`: maximum duration before the reading of a request times out and idle connections are closed.
- `--web.max-connections=512`: maximum number of simultaneous connections.
- `--web.enable-lifecycle`: enable shutdown and reload via HTTP requests (i.e. requests to `/-/reload` mentioned previously).
- `--web.page-title="..."`: change the header of the webpage we ran previously.
- `--storage.tsdb.retention.time`: the storage duration of the data (default: `15` days).
- `--storage.remote.read-concurrent-limit=10`: the number of targets that can be read simultaneously.
- `--log.level=info`: verbosity of the Prometheus server. One of the following can be set: `debug, info, warn or error`.

__Note: These flags have their categories first, followed by more categories (optional), with option as the last one.__

This is attributed to the structure of Prometheus and the language of choice, in this case, `golang`.

## Exporters

> Exporters are libraries that simplify the export of existing metrics from third-party systems and make them available to Prometheus to track. For example, an exporter for tracking GitHub metrics.

There are hundreds of exporters, currently all in different states of development. The full list of the most common ones can be found [here](https://prometheus.io/docs/instrumenting/exporters/), and a complete list of all exporters on GitHub can be found [here](https://github.com/prometheus/prometheus/wiki/Default-port-allocations):
- **official**: best practices, and verified by Prometheus (__always pick them whenever possible__).
- **unofficial**: working, but not verified for best practices or may have overlapping functionalities.
- **in development**: to be released as either of the two above.

*Things to note*
- __Most exporters occupy ports `9100` to `9999`__, and any new exporter should use it if available (explore the GitHub list above to determine which ports exporters run on).
- There are a few exporters outside the standard range (see the GitHub list).

To improve our understanding, we will employ one of the commonly used exporters, which will track the hardware and software metrics on your OS: __[Node exporter](https://github.com/prometheus/node_exporter)__.


### Setting up Node exporter (Linux/MAC)

> `Node` exporter is a single static binary that can be downloaded and run straight from the workstation.

- The following command will download the exporter, and unpack the `.tar.gz` archive (__we assume you have a Linux-based system__).
- You may the commands anywhere.
- If on MacOS, run `brew install node_exporter`.

In [None]:
# change the directory to where you want to download the node exporter.
# Use wget to download the linux node exporter tarball.
wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz  
# Unpack the tarball image into the temp directory.
tar xvfz node_exporter-1.1.2.linux-amd64.tar.gz
# Remove the tarball file after extraction.
rm node_exporter-1.1.2.linux-amd64.tar.gz
# Run the node exporter (Linux).
./node_exporter-1.1.2.linux-amd64/node_exporter
# Run the node exporter (MAC).
node_exporter --web.listen-address 127.0.0.1:9100

__Note:__ Add the following alias to your `.bashrc`, which will simplify the unpacking process of the tarball archives on Linux:
- `unpack` -> `tar xvzf`

After running the `node_exporter`, the output should be similar to that shown below:

<img src="images/node_exporter_output.png?modified=23213234453">

### Configuring Prometheus to scrape node exporter metrics

Now that the exporters are running, we need to set up the `Prometheus` server to scrape data from them.

We change the `prometheus.yml` config file to contain the setup below, which will allow the `node exporter` to scrape metrics from our system. 

#### For linux/MAC systems

In [None]:
global:
  scrape_interval: '1s'  # By default, scrape targets every 15 seconds.
  external_labels:
    monitor: 'codelab-monitor'

scrape_configs:
  # Prometheus monitoring itself.
  - job_name: 'prometheus'
    scrape_interval: '10s'
    static_configs:
      - targets: ['localhost:9090']
  # OS monitoring
  - job_name: 'node'
    scrape_interval: '5s'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          group: 'production' # Notice that we have defined two nodes to be labelled in the production environment.

As can be observed, the basic layout for defining new targets to scrape is

In [None]:
scrape_configs: # All targets will be defined in a config file.
  - job_name: 'prometheus' # Name of the job.
    scrape_interval: '10s' # Parameters defining how to scrape the target.
    static_configs:              # Allow you to specify a list of targets.    
      - targets: ['localhost:9090']  # Define the targets here.

Now, we need to pass this config file to the `Prometheus` server (__recall that it is contained in Docker__). There are two options for achieving this. We have already implemented the first, one which allows us to edit the config file while the server is running. 

There are two ways to do that:
- Mount the directory in your `localhost` to the Docker container during runtime:
    
    - Do this by running the `network=host` option when starting the container.
    - Do this if the configuration changes often and you have autoreload set.<br><br>

    
- Create a new Docker image, and copy the configuration:
    - Do this when the configuration is static and rarely changes.
    
Here, we will use the first approach. We will edit the configuration file locally and subsequently upload it to the server while it is running. For more information on the second approach using a `Dockerfile`, consult this page in the [documentation](https://prometheus.io/docs/prometheus/latest/installation/). 

At this point, we can push the locally updated `prometheus.yml` file to the Prometheus server using the same command used previously.

In [None]:
curl -X POST http://localhost:9090/-/reload

__Go to [`localhost:9090/targets`](http://localhost:9090/targets) to verify that everything was set up correctly.__ On the targets dashboard, you should be able to observe that both of our targets are in the state, **UP**; thus, metrics will be collected. Targets can also be in other predefined states:

- **Down**: Prometheus cannot connect to the target.
- **Unknown**: Prometheus cannot find the target, usually due to a configuration issue.

<img src="images/prom_target_dash.png?modified=232133242234453">

From the targets window, click on the endpoint of the node exporter: `http://localhost:9100/metrics`, to obtain a text list of all the metrics that are being scraped and that are available to Prometheus. 

This can be helpful when unsure of the available metrics. 

<img src="images/prom_metrics_endpoint.png?modified=23213344232453">

Once done, go to [`localhost:9100`](http://localhost:9090), and check if the commands prefixed with `node` are available:__

<img src="images/prom_win_commands.png?modified=232123422342453">

Run a few expressions to see the result of the graphs, and check `node_hwon_temp_celsius` which monitors the temperature of your system hardware.   

<img src="images/prom_win_process.png?modified=232122323442453">

__In the next lesson, we learn how to query data efficiently__. Prior to that, however, we discuss how to monitor metrics from Docker containers.

### Configuring Prometheus to track docker

To track Docker with Prometheus, we need to edit the Docker configuration so that we can specify its metric address. This will tell Prometheus where to collect the metrics. We do this by editing the `daemon.json` docker file or the Docker Desktop configuration file.

- **For Linux users,** navigate to `/etc/docker/daemon.json`; if the file does not already exist, create it using `sudo touch daemon.json`.
- **For MAC/Windows users,** go to Docker Desktop, and click the cog to go to **settings > Docker Engine**.

Add this code to either the `daemon.json` file or the Docker Engine config to configure the scraping of metrics. Add the code to the section of the `.json` file where the `experimental` key value is located.


In [None]:
{
    "metrics-addr" : "127.0.0.1:9323",
    "experimental": true,
    "features": {
    "buildkit": true
    }
}

For Docker Desktop, you will probably be required to restart your Docker Desktop application. 

On Linux, you will need to restart Docker using `sudo service docker restart`. You will also need to restart all your containers.

The next step involves updating the `Prometheus.yml` file to add docker as a target so that Prometheus can begin scraping the metrics. Add docker as a target in your `Prometheus.yml` file, as follows:

In [None]:
global:
  scrape_interval: '15s'  # By default, scrape targets every 15 seconds.
  scrape_timeout: '10s'
  external_labels:
    monitor: 'codelab-monitor'

scrape_configs:

  # Prometheus monitoring itself.
  - job_name: 'prometheus'
    scrape_interval: '10s'
    static_configs:
      - targets: ['localhost:9090']

  # OS monitoring
  - job_name: 'wmiexporter'
    scrape_interval: '30s'
    static_configs:
      - targets: ['localhost:9182']

  # Docker monitoring
  - job_name: 'docker'
         # metrics_path defaults to '/metrics'
         # scheme defaults to 'http'.
    static_configs:
      - targets: ['127.0.0.1:9323'] # Metrics address from our daemon.json file.
  

Now, update the config on the Prometheus server while it is running, using the command:

In [None]:
curl -X POST http://localhost:9090/-/reload

Observe the Prometheus targets pane from the server, and see that docker is now available as an endpoint to collect metrics. Remember to check its endpoint/metrics to view a list of available metrics to track.

Conventionally, for docker, they will start with `engine_daemon_container`.

<img src="images/prom_targets_docker.png?modified=23212332453">

Now, we attempt to start some containers. Start from Prometheus docker containers (similar to the previous case) or use Docker hub to create some containers.

Once they are running, you can run the expressions to view the output of the metrics. For example, here is the result after starting and stopping some containers from the `engine_daemon_container_states_containers` expression. 

<img src="images/prom_cont_states.png?modified=232132453">

Above, the red line indicates the containers stopping, while the blue line indicates the containers running.

## Conclusion
At this point, you should have a good understanding of how to
- start a Prometheus server inside a Docker container.
- access the Prometheus dashboard, and run some query expressions.
- create a `prometheus.yml` configuration file for a server which can be updated while the server is running.
- integrate the node exporter with the server to track hardware/software metrics from a machine.
- set up Prometheus to track the docker service. 
- define the targets to track in the `prometheus.yml` config file. 

## Further Reading 

- Explore the alternatives to Prometheus (read about them [here](https://prometheus.io/docs/introduction/comparison/#)), and determine when they are suitable to use. 
- Read about Prometheus's [Push Gateway](https://prometheus.io/docs/practices/pushing/) and when it should be used.
- Read briefly about alerts in Prometheus (e.g. when a failure occurs, you will be alerted via e-mail). Go through the [documentation](https://prometheus.io/docs/alerting/latest/overview/), and obtain a basic grasp (note, however, that alerting rules will be covered in the next lesson).
- Read about how to integrate the local disk storage (which is limited) with Prometheus's remote storage [here](https://prometheus.io/docs/prometheus/latest/storage/).
- Read about Prometheus Federation [here](https://prometheus.io/docs/prometheus/latest/federation/).