Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm collector: Correctly handle sending records that have already been sent to Auditor #681

Open
QuantumDancer opened this issue Feb 14, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@QuantumDancer
Copy link
Collaborator

For some reason, we see lines like this one in our logs:

Feb 14 15:56:13 host auditor-slurm-collector[13759]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"Failed sending record slurm-7989000 to Auditor instance. Record already exists.","level":40,"hostname":"host","pid":13759,"time":"2024-02-14T14:56:13.817031078Z","target":"auditor_slurm_collector::auditorsender","line":168,"file":"collectors/slurm/src/auditorsender.rs"}

As the error message suggests, the record already exists in the database:

# select * from accounting where record_id='slurm-7989000';
    id    |   record_id   |       start_time       |       stop_time        | runtime |          updated_at           
----------+---------------+------------------------+------------------------+---------+-------------------------------
 11737406 | slurm-7989000 | 2024-02-14 11:33:07+01 | 2024-02-14 15:49:37+01 |   15390 | 2024-02-14 15:51:13.805138+01
(1 row)

This happens not only once, but for multiple records. Currently, we have 54 of such records in the sending queue.

The issue here is, that these records are placed back into the sending queue: For the example record from above, a sending attempt has been made 12 times so far:

# journalctl -u auditor-slurm-collector | grep "Failed sending record slurm-7989000" | wc -l
14

In these cases, where the record already exists in the database, we should either stop placing these records in the sending queue again and completely drop them, or place them into something similar to a dead letter queue so that they can be inspected manually.

We also should understand why this is happening in the first place.

Using Auditor and Slurm collector version v0.3.1

@QuantumDancer QuantumDancer added the bug Something isn't working label Feb 14, 2024
@QuantumDancer
Copy link
Collaborator Author

We already handle this case

Err(ClientError::RecordExists) => {
tracing::warn!(
"Failed sending record {} to Auditor instance. Record already exists.",
id
);
database.delete(id).await?;
}

So the question is why does the delete not work?

@QuantumDancer
Copy link
Collaborator Author

The retries for the example record disappeared eventually:

$ journalctl -u auditor-slurm-collector | grep "Failed sending record slurm-7989000" | wc -l
26

# first entry
Feb 14 15:56:13 slurm.bfg.uni-freiburg.de auditor-slurm-collector[13759]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"Failed sending record slurm-7989000 to Auditor instance. Record already exists.","level":40,"hostname":"slurm.bfg.uni-freiburg.de","pid":13759,"time":"2024-02-14T14:56:13.817031078Z","target":"auditor_slurm_collector::auditorsender","line":168,"file":"collectors/slurm/src/auditorsender.rs"}
# last entry
Feb 14 17:56:47 slurm.bfg.uni-freiburg.de auditor-slurm-collector[28861]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"Failed sending record slurm-7989000 to Auditor instance. Record already exists.","level":40,"hostname":"slurm.bfg.uni-freiburg.de","pid":28861,"time":"2024-02-14T16:56:47.750486014Z","target":"auditor_slurm_collector::auditorsender","line":168,"file":"collectors/slurm/src/auditorsender.rs"}

Not sure if it's a coincidence that the job was stuck in the queue for almost exactly 2 hours.

@rkleinem
Copy link
Collaborator

I may have found it.
The problem seems to be in

let (nextcheck, rid) = if records.is_empty() {
(lastcheck, last_record_id)
} else {
let local_offset = Local::now().offset().utc_minus_local();
let (ts, rid) = records.iter().fold(
(chrono::DateTime::<Utc>::MIN_UTC, String::new()),
|(acc, _acc_record_id), r| {
(
acc.max(r.stop_time.unwrap()),
r.record_id.as_ref().to_string(),
)
},
);
(
DateTime::<Local>::from_naive_utc_and_offset(
ts.naive_utc(),
FixedOffset::east_opt(local_offset).unwrap(),
),
rid,
)
};

Local::now().offset().utc_minus_local(); generates a negative number for us.
Later a new offset is constructed with FixedOffset::east_opt, but using a negative number we then end up in the western hemisphere!

Your observation makes sense

Not sure if it's a coincidence that the job was stuck in the queue for almost exactly 2 hours.

At that time we had UTC+1. That one hour was not added but subtracted from UTC to build lastcheck.
So the records get deleted properly but are re-added later because the lastcheck timestamp was 2 hours off. (Right now it would be 4 hours)

This is likely the same problem as #178.

I suggest just doing:

DateTime::<Local>::from_naive_utc_and_offset(
    ts.naive_utc(),
    Local::now().offset(),
),

I think I will make a PR for this, #811 and #812 (since it's all the Slurm collector) later this week.

@QuantumDancer
Copy link
Collaborator Author

QuantumDancer commented May 13, 2024

Nice, thanks a lot for catching that! I'm no expert on timezone stuff in Rust, but if you think this should work please go ahead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants