Prometheus raise out of bounds error for all targets after resume the linux system from a suspend #8243

jeffrey4l · 2020-12-01T02:21:39Z

What did you do?

After suspend the system and resume again, prometheus report following error and can not scrape any new metrics, unless restart the Prometheus service.

What did you expect to see?

promtheus should continue to scrape new metrics.

What did you see instead? Under which circumstances?

check the log above. and in the webui, i got following

Environment

System information:

Linux t470p 5.4.79-1-lts #1 SMP Sun, 22 Nov 2020 14:22:21 +0000 x86_64 GNU/Linux

Prometheus version:

prometheus, version 2.22.2 (branch: tarball, revision: 2.22.2)
  build user:       someone@builder
  build date:       20201117-18:44:08
  go version:       go1.15.5
  platform:         linux/amd64

# prometheus.service

# /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus service
Requires=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Restart=on-failure
WorkingDirectory=/usr/share/prometheus
EnvironmentFile=-/etc/conf.d/prometheus
ExecStart=/usr/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/dat>
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65535
NoNewPrivileges=true
ProtectHome=true
ProtectSystem=full
ProtectHostname=true
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=true
LockPersonality=true
RestrictRealtime=yes
RestrictNamespaces=yes
MemoryDenyWriteExecute=yes
PrivateDevices=yes
CapabilityBoundingSet=

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/prometheus.service.d/prometheus.conf
[Service]
ExecStart=
ExecStart=/usr/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/home/prometheus $PROME>
ProtectHome=False

Prometheus configuration file:

---
global:
  scrape_interval: 300s
  evaluation_interval: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093
scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - localhost:9100
        labels:
          uid: t470

Logs:

Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.714Z caller=head.go:889 component=tsdb msg="WAL checkpoint complete" first=1668 last=1669 duration=32.511167ms
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.763Z caller=head.go:809 component=tsdb msg="Head GC completed" duration=739.34µs
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.812Z caller=head.go:809 component=tsdb msg="Head GC completed" duration=802.06µs
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.812Z caller=checkpoint.go:96 component=tsdb msg="Creating checkpoint" from_segment=1670 to_segment=1671 mint=1606780800000
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.822Z caller=head.go:889 component=tsdb msg="WAL checkpoint complete" first=1670 last=1671 duration=10.152588ms
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1378 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=172.20.149.141%3A22" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=6
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1145 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=xxxx%3A22" msg="Append failed" err="out of bounds"
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1094 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=xxx%3A22" msg="Appending scrape report failed" err="out of bounds"
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1378 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fxxx" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=17
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1145 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fwww.baidu.com" msg="Append failed" err="out of bounds"
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1094 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fwww.baidu.com" msg="Appending scrape report failed" err="out of bounds"
Dec 01 09:47:01 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:47:01.836Z caller=scrape.go:1378 component="scrape manager" scrape_pool=gitea target=http://localhost:10080/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=67
Dec 01 09:47:01 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:47:01.836Z caller=scrape.go:1145 component="scrape manager" scrape_pool=gitea target=http://localhost:10080/metrics msg="Append failed" err="out of bounds"

The text was updated successfully, but these errors were encountered:

roidelapluie · 2020-12-01T08:28:59Z

Thanks for your message. Suspending with Prometheus on is not supported at the moment, and TBH it is strange to suspend a monitoring system.

@codesome @bwplotka do you feel like we should considered this as a bug, or just something we can't/won't support?

brian-brazil · 2020-12-01T09:35:01Z

This sounds like time getting mixed up inside of Go. Generally if the system clock changes, things are going to break and there's not much we can do about it.

roidelapluie · 2020-12-01T09:41:09Z

This sounds like time getting mixed up inside of Go. Generally if the system clock changes, things are going to break and there's not much we can do about it.

It is not, just we only change the valid end times in TSDB when we reload the head I think. It is relative to head, not to time.Now()

brian-brazil · 2020-12-01T09:53:33Z

That's not right, the head only has a minimum timestamp for bounds checking - no maximum. The maximum is a safety check in the scrape code.

brian-brazil · 2020-12-07T15:53:04Z

Can you share the history of prometheus_tsdb_head_max_time_seconds, prometheus_tsdb_head_min_time_seconds, and give a timeline of the suspend? Can you confirm that this doesn't just affect one scrape post suspend, but ongoing?

fgm · 2021-02-04T20:14:08Z

FWIW, same issue on macOS: I'm experimenting with Prometheus in a container on my mac, so it goes to sleep at night, and when it wakes up, I get this error too. What's even more problematic is that even when stopping and restarting the container, the issue resumes: some information seems to be corrupted in storage, remaining across restarts.

franz-josef-kaiser · 2021-02-04T21:45:36Z

I am having the exact same issue:

macOS 1.11.
Docker Desktop 3.1
Docker-Compose

The hosts localtime is shared with all containers as read-only ro volume: /private/etc/localtime:/etc/localtime:ro.

Below are the docker logs from the container:

  monitor     | level=info ts=2021-02-04T17:32:39.671Z caller=main.go:326 msg="No time or size retention was set so using the default time retention" duration=15d
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:364 msg="Starting Prometheus" version="(version=2.24.1, branch=HEAD, revision=e4487274853c587717006eeda8804e597d120340)"
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:369 build_context="(go=go1.15.6, user=root@0b5231a0de0f, date=20210120-00:09:36)"
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:370 host_details="(Linux 4.19.121-linuxkit #1 SMP Tue Dec 1 17:50:32 UTC 2020 x86_64 monitor (none))"
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:371 fd_limits="(soft=1048576, hard=1048576)"
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:372 vm_limits="(soft=unlimited, hard=unlimited)"
  monitor     | level=info ts=2021-02-04T17:32:39.695Z caller=web.go:530 component=web msg="Start listening for connections" address=0.0.0.0:9090
  monitor     | level=info ts=2021-02-04T17:32:39.720Z caller=main.go:738 msg="Starting TSDB ..."
  monitor     | level=info ts=2021-02-04T17:32:39.729Z caller=tls_config.go:227 component=web msg="TLS is disabled." http2=false
  monitor     | level=info ts=2021-02-04T17:32:39.778Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
  monitor     | level=info ts=2021-02-04T17:32:39.779Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=1.1577ms
  monitor     | level=info ts=2021-02-04T17:32:39.779Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while"
  monitor     | level=info ts=2021-02-04T17:32:39.784Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
  monitor     | level=info ts=2021-02-04T17:32:39.784Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=163.1µs wal_replay_duration=4.9535ms total_replay_duration=6.377ms
  monitor     | level=info ts=2021-02-04T17:32:39.792Z caller=main.go:758 fs_type=EXT4_SUPER_MAGIC
  monitor     | level=info ts=2021-02-04T17:32:39.793Z caller=main.go:761 msg="TSDB started"
  monitor     | level=info ts=2021-02-04T17:32:39.793Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/root.yml
  monitor     | level=info ts=2021-02-04T17:32:39.837Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/root.yml totalDuration=44.0719ms remote_storage=29.8µs web_handler=8.5µs query_engine=9.4µs scrape=2.7028ms scrape_sd=13.3593ms notify=11.4µs notify_sd=16µs rules=16.9668ms
  monitor     | level=info ts=2021-02-04T17:32:39.837Z caller=main.go:710 msg="Server is ready to receive web requests."
  monitor     | level=info ts=2021-02-04T21:00:58.179Z caller=compact.go:507 component=tsdb msg="write block" mint=1612459968023 maxt=1612461600000 ulid=01EXQDJ6T6K3D9YMVM709VS4ZZ duration=125.4913ms
  monitor     | level=info ts=2021-02-04T21:00:58.255Z caller=head.go:824 component=tsdb msg="Head GC completed" duration=75.2094ms
  monitor     | level=info ts=2021-02-04T21:00:58.558Z caller=compact.go:507 component=tsdb msg="write block" mint=1612461603023 maxt=1612468800000 ulid=01EXQDJ70MZ8XHJF6EB8YRJTN8 duration=297.3878ms
  monitor     | level=info ts=2021-02-04T21:00:58.560Z caller=head.go:824 component=tsdb msg="Head GC completed" duration=1.3618ms
  monitor     | level=warn ts=2021-02-04T21:01:05.233Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy-exporter target=http://exporter:9113/prometheus msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:09.803Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy target=http://example.test:80/stub_status msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:09.899Z caller=scrape.go:1094 component="scrape manager" scrape_pool=monitor target=https://localhost:9090/metrics msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:10.825Z caller=scrape.go:1094 component="scrape manager" scrape_pool=controller target=http://traefik.example.test:8082/metrics msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:12.617Z caller=scrape.go:1094 component="scrape manager" scrape_pool=backend target=https://example.test:443/status%3Fjson msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:12.674Z caller=scrape.go:1094 component="scrape manager" scrape_pool=dashboard target=https://dashboard.example.test:443/api/health msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:13.071Z caller=scrape.go:1094 component="scrape manager" scrape_pool=alerts target=https://alerts.example.test:443/-/healthy msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:17.920Z caller=scrape.go:1094 component="scrape manager" scrape_pool=admin target=https://hi.example.test:443/health msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:20.229Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy-exporter target=http://exporter:9113/prometheus msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:25.256Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy target=http://bell.example:80/stub_status msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:28.136Z caller=scrape.go:1094 component="scrape manager" scrape_pool=dashboard target=https://dashboard.example.test:443/api/health msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:40.257Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy target=http://example.test:80/stub_status msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:40.352Z caller=scrape.go:1094 component="scrape manager" scrape_pool=monitor target=https://localhost:9090/metrics msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:41.277Z caller=scrape.go:1094 component="scrape manager" scrape_pool=controller target=http://traefik.example.test:8082/metrics msg="Appending scrape report failed" err="out of bounds"

evtimmy · 2021-02-27T19:41:20Z

Same issue with running in minikube on Mac OS. It will be nice to fix this so that makes it easier to experiment with Prometheus stack developing in local clusters

slovdahl · 2021-03-01T08:52:49Z

Same issue experienced on Ubuntu 20.04 running Prometheus in Docker for a development environment that is suspended outside office hours. A restart of the container seems to have fixed it this time.

$ docker -v
Docker version 20.10.4, build d3cb89e

version: '2.2'

services:
  prometheus:
    image: prom/prometheus:v2.22.0
    container_name: prometheus
    volumes:
      - ./prometheus/:/etc/prometheus/
      - ./data/prometheus:/prometheus
    ports:
      - 9090:9090
    entrypoint:
     - "/bin/prometheus"
     - "--config.file=/etc/prometheus/prometheus.yml"
     - "--storage.tsdb.path=/prometheus"
     - '--storage.tsdb.retention.time=52w'
     - "--web.console.libraries=/etc/prometheus/console_libraries"
     - "--web.console.templates=/etc/prometheus/consoles"
     - "--web.enable-lifecycle"
     - "--web.enable-admin-api"
    restart: always

timofey-barmin · 2021-03-11T01:31:05Z

I experience the same issue (on various environments). Looks like restart of prometheus helps, but it's not clear how big the time jump must be to trigger the problem.
@brian-brazil @roidelapluie Any suggestions on that? Asking because I plan to do an automatic restart of prometheus in case of a time change, so I need to know the time difference my script should react to.
Maybe there is another way to detect the problem from a script?

dgl · 2021-03-15T13:58:29Z

I'm hopeful the linked PR #8601 will fix this. (Note if you want to test it, you likely need to suspend for >10 minutes).

timofey-barmin · 2021-03-15T23:52:22Z

@dgl Thanks
I can confirm it helps me on mac (vsn 10.15.7). Haven't tested on other platforms yet.

roidelapluie · 2021-03-16T15:15:35Z

This will be released in the Prometheus 2.26 series.

A workaround if you run an earlier version is to use the --no-scrape.adjust-timestamps flag.

roidelapluie · 2021-03-16T20:53:23Z

I have changed my mind and released v2.25.2 with this fix. Thanks all!

tajchert · 2021-04-30T02:41:14Z

I'm on 2.26.0 and got that issue on Windows 10 after updating time.

level=info ts=2021-04-30T02:39:14.596Z caller=main.go:767 msg="Server is ready to receive web requests."
level=warn ts=2021-04-30T02:39:25.208Z caller=scrape.go:1402 component="scrape manager" scrape_pool=prometheus target=http://localhost:9182/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=7815
level=warn ts=2021-04-30T02:39:25.208Z caller=scrape.go:1104 component="scrape manager" scrape_pool=prometheus target=http://localhost:9182/metrics msg="Appending scrape report failed" err="out of bounds"

roidelapluie · 2021-04-30T09:51:34Z

if you shift time backwards this is expected. There is no reasonable way we can deal with that I think.

tajchert · 2021-04-30T10:39:27Z

Yes, it was a backward change (BIOS update for some reason kicked the time 4h ahead). After passing the incorrectly set time data started to come normally.

jeffrey4l changed the title ~~Prometheus raise out of bounds error for all target after resume the system from a suspend~~ Prometheus raise out of bounds error for all targets after resume the linux system from a suspend Dec 1, 2020

brian-brazil added the kind/more-info-needed label Dec 7, 2020

dgl mentioned this issue Mar 15, 2021

Ensure that timestamp comparison uses wall clock time #8601

Merged

roidelapluie closed this as completed in #8601 Mar 16, 2021

This was referenced Apr 12, 2021

Out-of-order samples with Prometheus just scraping itself #8394

Closed

prometheus take uncertainty recovery time after system time backward #8312

Open

Aisuko mentioned this issue Apr 14, 2021

[prometheus-kube-stack] Prometheus raise out of bounds error for all targets after resume the linux system from a suspend prometheus-community/helm-charts#841

Closed

prometheus locked as resolved and limited conversation to collaborators Nov 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus raise out of bounds error for all targets after resume the linux system from a suspend #8243

Prometheus raise out of bounds error for all targets after resume the linux system from a suspend #8243

jeffrey4l commented Dec 1, 2020

roidelapluie commented Dec 1, 2020

brian-brazil commented Dec 1, 2020

roidelapluie commented Dec 1, 2020

brian-brazil commented Dec 1, 2020

brian-brazil commented Dec 7, 2020

fgm commented Feb 4, 2021 •

edited

franz-josef-kaiser commented Feb 4, 2021

evtimmy commented Feb 27, 2021

slovdahl commented Mar 1, 2021

timofey-barmin commented Mar 11, 2021

dgl commented Mar 15, 2021

timofey-barmin commented Mar 15, 2021

roidelapluie commented Mar 16, 2021

roidelapluie commented Mar 16, 2021

tajchert commented Apr 30, 2021

roidelapluie commented Apr 30, 2021

tajchert commented Apr 30, 2021

Prometheus raise out of bounds error for all targets after resume the linux system from a suspend #8243

Prometheus raise out of bounds error for all targets after resume the linux system from a suspend #8243

Comments

jeffrey4l commented Dec 1, 2020

roidelapluie commented Dec 1, 2020

brian-brazil commented Dec 1, 2020

roidelapluie commented Dec 1, 2020

brian-brazil commented Dec 1, 2020

brian-brazil commented Dec 7, 2020

fgm commented Feb 4, 2021 • edited

franz-josef-kaiser commented Feb 4, 2021

evtimmy commented Feb 27, 2021

slovdahl commented Mar 1, 2021

timofey-barmin commented Mar 11, 2021

dgl commented Mar 15, 2021

timofey-barmin commented Mar 15, 2021

roidelapluie commented Mar 16, 2021

roidelapluie commented Mar 16, 2021

tajchert commented Apr 30, 2021

roidelapluie commented Apr 30, 2021

tajchert commented Apr 30, 2021

fgm commented Feb 4, 2021 •

edited