Slurm collector: Correctly handle sending records that have already been sent to Auditor #681

QuantumDancer · 2024-02-14T16:04:18Z

For some reason, we see lines like this one in our logs:

Feb 14 15:56:13 host auditor-slurm-collector[13759]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"Failed sending record slurm-7989000 to Auditor instance. Record already exists.","level":40,"hostname":"host","pid":13759,"time":"2024-02-14T14:56:13.817031078Z","target":"auditor_slurm_collector::auditorsender","line":168,"file":"collectors/slurm/src/auditorsender.rs"}

As the error message suggests, the record already exists in the database:

# select * from accounting where record_id='slurm-7989000';
    id    |   record_id   |       start_time       |       stop_time        | runtime |          updated_at           
----------+---------------+------------------------+------------------------+---------+-------------------------------
 11737406 | slurm-7989000 | 2024-02-14 11:33:07+01 | 2024-02-14 15:49:37+01 |   15390 | 2024-02-14 15:51:13.805138+01
(1 row)

This happens not only once, but for multiple records. Currently, we have 54 of such records in the sending queue.

The issue here is, that these records are placed back into the sending queue: For the example record from above, a sending attempt has been made 12 times so far:

# journalctl -u auditor-slurm-collector | grep "Failed sending record slurm-7989000" | wc -l
14

In these cases, where the record already exists in the database, we should either stop placing these records in the sending queue again and completely drop them, or place them into something similar to a dead letter queue so that they can be inspected manually.

We also should understand why this is happening in the first place.

Using Auditor and Slurm collector version v0.3.1

The text was updated successfully, but these errors were encountered:

QuantumDancer · 2024-02-14T16:10:13Z

We already handle this case

AUDITOR/collectors/slurm/src/auditorsender.rs

Lines 167 to 173 in 65cdcdd

    
           Err(ClientError::RecordExists) => { 
        
               tracing::warn!( 
        
                   "Failed sending record {} to Auditor instance. Record already exists.", 
        
                   id 
        
               ); 
        
               database.delete(id).await?; 
        
           }

So the question is why does the delete not work?

QuantumDancer · 2024-02-15T08:24:41Z

The retries for the example record disappeared eventually:

$ journalctl -u auditor-slurm-collector | grep "Failed sending record slurm-7989000" | wc -l
26

# first entry
Feb 14 15:56:13 slurm.bfg.uni-freiburg.de auditor-slurm-collector[13759]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"Failed sending record slurm-7989000 to Auditor instance. Record already exists.","level":40,"hostname":"slurm.bfg.uni-freiburg.de","pid":13759,"time":"2024-02-14T14:56:13.817031078Z","target":"auditor_slurm_collector::auditorsender","line":168,"file":"collectors/slurm/src/auditorsender.rs"}
# last entry
Feb 14 17:56:47 slurm.bfg.uni-freiburg.de auditor-slurm-collector[28861]: {"v":0,"name":"AUDITOR-slurm-collector","msg":"Failed sending record slurm-7989000 to Auditor instance. Record already exists.","level":40,"hostname":"slurm.bfg.uni-freiburg.de","pid":28861,"time":"2024-02-14T16:56:47.750486014Z","target":"auditor_slurm_collector::auditorsender","line":168,"file":"collectors/slurm/src/auditorsender.rs"}

Not sure if it's a coincidence that the job was stuck in the queue for almost exactly 2 hours.

rkleinem · 2024-05-13T11:03:00Z

I may have found it.
The problem seems to be in

AUDITOR/collectors/slurm/src/sacctcaller.rs

Lines 164 to 184 in 256395c

    
           let (nextcheck, rid) = if records.is_empty() { 
        
               (lastcheck, last_record_id) 
        
           } else { 
        
               let local_offset = Local::now().offset().utc_minus_local(); 
        
               let (ts, rid) = records.iter().fold( 
        
                   (chrono::DateTime::<Utc>::MIN_UTC, String::new()), 
        
                   |(acc, _acc_record_id), r| { 
        
                       ( 
        
                           acc.max(r.stop_time.unwrap()), 
        
                           r.record_id.as_ref().to_string(), 
        
                       ) 
        
                   }, 
        
               ); 
        
               ( 
        
                   DateTime::<Local>::from_naive_utc_and_offset( 
        
                       ts.naive_utc(), 
        
                       FixedOffset::east_opt(local_offset).unwrap(), 
        
                   ), 
        
                   rid, 
        
               ) 
        
           };

Local::now().offset().utc_minus_local(); generates a negative number for us.
Later a new offset is constructed with FixedOffset::east_opt, but using a negative number we then end up in the western hemisphere!

Your observation makes sense

Not sure if it's a coincidence that the job was stuck in the queue for almost exactly 2 hours.

At that time we had UTC+1. That one hour was not added but subtracted from UTC to build lastcheck.
So the records get deleted properly but are re-added later because the lastcheck timestamp was 2 hours off. (Right now it would be 4 hours)

This is likely the same problem as #178.

I suggest just doing:

DateTime::<Local>::from_naive_utc_and_offset(
    ts.naive_utc(),
    Local::now().offset(),
),

I think I will make a PR for this, #811 and #812 (since it's all the Slurm collector) later this week.

QuantumDancer · 2024-05-13T15:07:27Z

Nice, thanks a lot for catching that! I'm no expert on timezone stuff in Rust, but if you think this should work please go ahead!

…macher#681,ALU-Schumacher#178)

QuantumDancer added the bug Something isn't working label Feb 14, 2024

rkleinem pushed a commit to rkleinem/AUDITOR that referenced this issue Jun 6, 2024

Slurm collector fixes (ALU-Schumacher#811,ALU-Schumacher#812,ALU-Schu…

dd45853

…macher#681,ALU-Schumacher#178)

rkleinem mentioned this issue Jun 6, 2024

Slurm collector fixes (#811,#812,#681,#178) #844

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm collector: Correctly handle sending records that have already been sent to Auditor #681

Slurm collector: Correctly handle sending records that have already been sent to Auditor #681

QuantumDancer commented Feb 14, 2024

QuantumDancer commented Feb 14, 2024

QuantumDancer commented Feb 15, 2024

rkleinem commented May 13, 2024

QuantumDancer commented May 13, 2024 •

edited

Slurm collector: Correctly handle sending records that have already been sent to Auditor #681

Slurm collector: Correctly handle sending records that have already been sent to Auditor #681

Comments

QuantumDancer commented Feb 14, 2024

QuantumDancer commented Feb 14, 2024

QuantumDancer commented Feb 15, 2024

rkleinem commented May 13, 2024

QuantumDancer commented May 13, 2024 • edited

QuantumDancer commented May 13, 2024 •

edited