Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus raise out of bounds error for all targets after resume the linux system from a suspend #8243

Closed
jeffrey4l opened this issue Dec 1, 2020 · 17 comments · Fixed by #8601

Comments

@jeffrey4l
Copy link
Contributor

What did you do?

After suspend the system and resume again, prometheus report following error and can not scrape any new metrics, unless restart the Prometheus service.

What did you expect to see?

promtheus should continue to scrape new metrics.

What did you see instead? Under which circumstances?

check the log above. and in the webui, i got following

image

Environment

  • System information:
Linux t470p 5.4.79-1-lts #1 SMP Sun, 22 Nov 2020 14:22:21 +0000 x86_64 GNU/Linux
  • Prometheus version:
prometheus, version 2.22.2 (branch: tarball, revision: 2.22.2)
  build user:       someone@builder
  build date:       20201117-18:44:08
  go version:       go1.15.5
  platform:         linux/amd64
# prometheus.service

# /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus service
Requires=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Restart=on-failure
WorkingDirectory=/usr/share/prometheus
EnvironmentFile=-/etc/conf.d/prometheus
ExecStart=/usr/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/dat>
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65535
NoNewPrivileges=true
ProtectHome=true
ProtectSystem=full
ProtectHostname=true
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=true
LockPersonality=true
RestrictRealtime=yes
RestrictNamespaces=yes
MemoryDenyWriteExecute=yes
PrivateDevices=yes
CapabilityBoundingSet=

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/prometheus.service.d/prometheus.conf
[Service]
ExecStart=
ExecStart=/usr/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/home/prometheus $PROME>
ProtectHome=False
  • Prometheus configuration file:
---
global:
  scrape_interval: 300s
  evaluation_interval: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093
scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - localhost:9100
        labels:
          uid: t470
  • Logs:
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.714Z caller=head.go:889 component=tsdb msg="WAL checkpoint complete" first=1668 last=1669 duration=32.511167ms
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.763Z caller=head.go:809 component=tsdb msg="Head GC completed" duration=739.34µs
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.812Z caller=head.go:809 component=tsdb msg="Head GC completed" duration=802.06µs
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.812Z caller=checkpoint.go:96 component=tsdb msg="Creating checkpoint" from_segment=1670 to_segment=1671 mint=1606780800000
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.822Z caller=head.go:889 component=tsdb msg="WAL checkpoint complete" first=1670 last=1671 duration=10.152588ms
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1378 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=172.20.149.141%3A22" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=6
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1145 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=xxxx%3A22" msg="Append failed" err="out of bounds"
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1094 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=xxx%3A22" msg="Appending scrape report failed" err="out of bounds"
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1378 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fxxx" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=17
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1145 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fwww.baidu.com" msg="Append failed" err="out of bounds"
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1094 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fwww.baidu.com" msg="Appending scrape report failed" err="out of bounds"
Dec 01 09:47:01 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:47:01.836Z caller=scrape.go:1378 component="scrape manager" scrape_pool=gitea target=http://localhost:10080/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=67
Dec 01 09:47:01 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:47:01.836Z caller=scrape.go:1145 component="scrape manager" scrape_pool=gitea target=http://localhost:10080/metrics msg="Append failed" err="out of bounds"
@jeffrey4l jeffrey4l changed the title Prometheus raise out of bounds error for all target after resume the system from a suspend Prometheus raise out of bounds error for all targets after resume the linux system from a suspend Dec 1, 2020
@roidelapluie
Copy link
Member

Thanks for your message. Suspending with Prometheus on is not supported at the moment, and TBH it is strange to suspend a monitoring system.

@codesome @bwplotka do you feel like we should considered this as a bug, or just something we can't/won't support?

@brian-brazil
Copy link
Contributor

This sounds like time getting mixed up inside of Go. Generally if the system clock changes, things are going to break and there's not much we can do about it.

@roidelapluie
Copy link
Member

This sounds like time getting mixed up inside of Go. Generally if the system clock changes, things are going to break and there's not much we can do about it.

It is not, just we only change the valid end times in TSDB when we reload the head I think. It is relative to head, not to time.Now()

@brian-brazil
Copy link
Contributor

That's not right, the head only has a minimum timestamp for bounds checking - no maximum. The maximum is a safety check in the scrape code.

@brian-brazil
Copy link
Contributor

Can you share the history of prometheus_tsdb_head_max_time_seconds, prometheus_tsdb_head_min_time_seconds, and give a timeline of the suspend? Can you confirm that this doesn't just affect one scrape post suspend, but ongoing?

@fgm
Copy link

fgm commented Feb 4, 2021

FWIW, same issue on macOS: I'm experimenting with Prometheus in a container on my mac, so it goes to sleep at night, and when it wakes up, I get this error too. What's even more problematic is that even when stopping and restarting the container, the issue resumes: some information seems to be corrupted in storage, remaining across restarts.

@franz-josef-kaiser
Copy link

I am having the exact same issue:

  • macOS 1.11.
  • Docker Desktop 3.1
  • Docker-Compose

The hosts localtime is shared with all containers as read-only ro volume: /private/etc/localtime:/etc/localtime:ro.

Below are the docker logs from the container:

  monitor     | level=info ts=2021-02-04T17:32:39.671Z caller=main.go:326 msg="No time or size retention was set so using the default time retention" duration=15d
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:364 msg="Starting Prometheus" version="(version=2.24.1, branch=HEAD, revision=e4487274853c587717006eeda8804e597d120340)"
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:369 build_context="(go=go1.15.6, user=root@0b5231a0de0f, date=20210120-00:09:36)"
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:370 host_details="(Linux 4.19.121-linuxkit #1 SMP Tue Dec 1 17:50:32 UTC 2020 x86_64 monitor (none))"
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:371 fd_limits="(soft=1048576, hard=1048576)"
  monitor     | level=info ts=2021-02-04T17:32:39.672Z caller=main.go:372 vm_limits="(soft=unlimited, hard=unlimited)"
  monitor     | level=info ts=2021-02-04T17:32:39.695Z caller=web.go:530 component=web msg="Start listening for connections" address=0.0.0.0:9090
  monitor     | level=info ts=2021-02-04T17:32:39.720Z caller=main.go:738 msg="Starting TSDB ..."
  monitor     | level=info ts=2021-02-04T17:32:39.729Z caller=tls_config.go:227 component=web msg="TLS is disabled." http2=false
  monitor     | level=info ts=2021-02-04T17:32:39.778Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any"
  monitor     | level=info ts=2021-02-04T17:32:39.779Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=1.1577ms
  monitor     | level=info ts=2021-02-04T17:32:39.779Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while"
  monitor     | level=info ts=2021-02-04T17:32:39.784Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
  monitor     | level=info ts=2021-02-04T17:32:39.784Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=163.1µs wal_replay_duration=4.9535ms total_replay_duration=6.377ms
  monitor     | level=info ts=2021-02-04T17:32:39.792Z caller=main.go:758 fs_type=EXT4_SUPER_MAGIC
  monitor     | level=info ts=2021-02-04T17:32:39.793Z caller=main.go:761 msg="TSDB started"
  monitor     | level=info ts=2021-02-04T17:32:39.793Z caller=main.go:887 msg="Loading configuration file" filename=/etc/prometheus/root.yml
  monitor     | level=info ts=2021-02-04T17:32:39.837Z caller=main.go:918 msg="Completed loading of configuration file" filename=/etc/prometheus/root.yml totalDuration=44.0719ms remote_storage=29.8µs web_handler=8.5µs query_engine=9.4µs scrape=2.7028ms scrape_sd=13.3593ms notify=11.4µs notify_sd=16µs rules=16.9668ms
  monitor     | level=info ts=2021-02-04T17:32:39.837Z caller=main.go:710 msg="Server is ready to receive web requests."
  monitor     | level=info ts=2021-02-04T21:00:58.179Z caller=compact.go:507 component=tsdb msg="write block" mint=1612459968023 maxt=1612461600000 ulid=01EXQDJ6T6K3D9YMVM709VS4ZZ duration=125.4913ms
  monitor     | level=info ts=2021-02-04T21:00:58.255Z caller=head.go:824 component=tsdb msg="Head GC completed" duration=75.2094ms
  monitor     | level=info ts=2021-02-04T21:00:58.558Z caller=compact.go:507 component=tsdb msg="write block" mint=1612461603023 maxt=1612468800000 ulid=01EXQDJ70MZ8XHJF6EB8YRJTN8 duration=297.3878ms
  monitor     | level=info ts=2021-02-04T21:00:58.560Z caller=head.go:824 component=tsdb msg="Head GC completed" duration=1.3618ms
  monitor     | level=warn ts=2021-02-04T21:01:05.233Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy-exporter target=http://exporter:9113/prometheus msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:09.803Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy target=http://example.test:80/stub_status msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:09.899Z caller=scrape.go:1094 component="scrape manager" scrape_pool=monitor target=https://localhost:9090/metrics msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:10.825Z caller=scrape.go:1094 component="scrape manager" scrape_pool=controller target=http://traefik.example.test:8082/metrics msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:12.617Z caller=scrape.go:1094 component="scrape manager" scrape_pool=backend target=https://example.test:443/status%3Fjson msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:12.674Z caller=scrape.go:1094 component="scrape manager" scrape_pool=dashboard target=https://dashboard.example.test:443/api/health msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:13.071Z caller=scrape.go:1094 component="scrape manager" scrape_pool=alerts target=https://alerts.example.test:443/-/healthy msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:17.920Z caller=scrape.go:1094 component="scrape manager" scrape_pool=admin target=https://hi.example.test:443/health msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:20.229Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy-exporter target=http://exporter:9113/prometheus msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:25.256Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy target=http://bell.example:80/stub_status msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:28.136Z caller=scrape.go:1094 component="scrape manager" scrape_pool=dashboard target=https://dashboard.example.test:443/api/health msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:40.257Z caller=scrape.go:1094 component="scrape manager" scrape_pool=proxy target=http://example.test:80/stub_status msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:40.352Z caller=scrape.go:1094 component="scrape manager" scrape_pool=monitor target=https://localhost:9090/metrics msg="Appending scrape report failed" err="out of bounds"
  monitor     | level=warn ts=2021-02-04T21:01:41.277Z caller=scrape.go:1094 component="scrape manager" scrape_pool=controller target=http://traefik.example.test:8082/metrics msg="Appending scrape report failed" err="out of bounds"

@evtimmy
Copy link

evtimmy commented Feb 27, 2021

Same issue with running in minikube on Mac OS. It will be nice to fix this so that makes it easier to experiment with Prometheus stack developing in local clusters

@slovdahl
Copy link

slovdahl commented Mar 1, 2021

Same issue experienced on Ubuntu 20.04 running Prometheus in Docker for a development environment that is suspended outside office hours. A restart of the container seems to have fixed it this time.

$ docker -v
Docker version 20.10.4, build d3cb89e
version: '2.2'

services:
  prometheus:
    image: prom/prometheus:v2.22.0
    container_name: prometheus
    volumes:
      - ./prometheus/:/etc/prometheus/
      - ./data/prometheus:/prometheus
    ports:
      - 9090:9090
    entrypoint:
     - "/bin/prometheus"
     - "--config.file=/etc/prometheus/prometheus.yml"
     - "--storage.tsdb.path=/prometheus"
     - '--storage.tsdb.retention.time=52w'
     - "--web.console.libraries=/etc/prometheus/console_libraries"
     - "--web.console.templates=/etc/prometheus/consoles"
     - "--web.enable-lifecycle"
     - "--web.enable-admin-api"
    restart: always

@timofey-barmin
Copy link

I experience the same issue (on various environments). Looks like restart of prometheus helps, but it's not clear how big the time jump must be to trigger the problem.
@brian-brazil @roidelapluie Any suggestions on that? Asking because I plan to do an automatic restart of prometheus in case of a time change, so I need to know the time difference my script should react to.
Maybe there is another way to detect the problem from a script?

@dgl
Copy link
Member

dgl commented Mar 15, 2021

I'm hopeful the linked PR #8601 will fix this. (Note if you want to test it, you likely need to suspend for >10 minutes).

@timofey-barmin
Copy link

@dgl Thanks
I can confirm it helps me on mac (vsn 10.15.7). Haven't tested on other platforms yet.

@roidelapluie
Copy link
Member

This will be released in the Prometheus 2.26 series.

A workaround if you run an earlier version is to use the --no-scrape.adjust-timestamps flag.

@roidelapluie
Copy link
Member

I have changed my mind and released v2.25.2 with this fix. Thanks all!

@tajchert
Copy link

I'm on 2.26.0 and got that issue on Windows 10 after updating time.

level=info ts=2021-04-30T02:39:14.596Z caller=main.go:767 msg="Server is ready to receive web requests."
level=warn ts=2021-04-30T02:39:25.208Z caller=scrape.go:1402 component="scrape manager" scrape_pool=prometheus target=http://localhost:9182/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=7815
level=warn ts=2021-04-30T02:39:25.208Z caller=scrape.go:1104 component="scrape manager" scrape_pool=prometheus target=http://localhost:9182/metrics msg="Appending scrape report failed" err="out of bounds"

@roidelapluie
Copy link
Member

if you shift time backwards this is expected. There is no reasonable way we can deal with that I think.

@tajchert
Copy link

Yes, it was a backward change (BIOS update for some reason kicked the time 4h ahead). After passing the incorrectly set time data started to come normally.

@prometheus prometheus locked as resolved and limited conversation to collaborators Nov 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants