Skip to content

Conversation

@progier389
Copy link
Contributor

@progier389 progier389 commented Jun 17, 2024

Several issues related to backup task error handling:
Backends stay busy after the failure
Exit code is 0 in some cases
Crash if failing to open the backup directory
And a more general one:
lib389 Task DN collision

Solutions:
Always reset the busy flags that have been set
Ensure that 0 is not returned in error case
Avoid closing NULL directory descriptor
Use a timestamp having milliseconds precision to create the task DN

Issue: #6229

Reviewed by: @droideck (Thanks!)

@progier389 progier389 force-pushed the i3967 branch 2 times, most recently from 936284b to 9875301 Compare June 18, 2024 11:20
@progier389
Copy link
Contributor Author

progier389 commented Jun 18, 2024

Increased the precision of the timestamp used to generate task CN: With one second precision the CI test is randomly failing because of task DN collision.

@progier389
Copy link
Contributor Author

progier389 commented Jun 20, 2024

Looks like there is another race condition:
The second backup task still sometimes fails:

           exitCode = tasks.db2bak(backup_dir=archive_dir2, args={TASK_WAIT: True})
>           assert exitCode == 0
E           assert -1 == 0

@progier389
Copy link
Contributor Author

Looks like I did not fix the right place: it is the same task name conflict issue (and I do not see subsecond in task CN)
[18/Jun/2024:11:33:06.073635310 +0000] conn=1 op=6 ADD dn="cn=backup_06182024_113306,cn=backup,cn=tasks,cn=config"
[18/Jun/2024:11:33:06.078141131 +0000] conn=1 op=6 RESULT err=0 tag=105 nentries=0 wtime=0.000214625 optime=0.004512012 etime=0.004725204
...
[18/Jun/2024:11:33:06.282371662 +0000] conn=1 op=10 ADD dn="cn=backup_06182024_113306,cn=backup,cn=tasks,cn=config"
[18/Jun/2024:11:33:06.283060970 +0000] conn=1 op=10 RESULT err=68 tag=105 nentries=0 wtime=0.000159735 optime=0.000696302 etime=0.000854243

@progier389 progier389 force-pushed the i3967 branch 3 times, most recently from 5f93d75 to 8739d37 Compare June 20, 2024 14:16
@progier389
Copy link
Contributor Author

Test is still failing. now I think it is related to private tmp

[20/Jun/2024:13:17:36.826947195 +0000] - ERR - ldbm_back_ldbm2archive - mkdir(/tmp/tmpoyb8i9aj/bak2) failed; errno 2 (Unexpected dbimpl error code)
[20/Jun/2024:13:17:36.827410826 +0000] - ERR - ldbm_back_ldbm2archive - Failed removing /tmp/tmpoyb8i9aj/bak2
[20/Jun/2024:13:17:36.828107407 +0000] - ERR - task_backup_thread - Backup failed (error -1)

will change the temporary directoty

@progier389 progier389 force-pushed the i3967 branch 2 times, most recently from 4da66d8 to 0ccd43b Compare June 21, 2024 17:40
@progier389
Copy link
Contributor Author

Have to retry the first backup in loop until it fails (sometime it does not)

Copy link
Member

@droideck droideck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Concerns are minor and ignorable if you'd like.

@progier389
Copy link
Contributor Author

Fixed @droideck remarks.

@progier389 progier389 self-assigned this Jun 28, 2024
@progier389 progier389 merged commit 04a0b6a into 389ds:main Jun 28, 2024
progier389 added a commit that referenced this pull request Feb 6, 2025
…#6230)

* Issue 6229 - After an initial failure, subsequent online backups will not work

Several issues related to backup task error handling:
Backends stay busy after the failure
Exit code is 0 in some cases
Crash if failing to open the backup directory
And a more general one:
lib389 Task DN collision

Solutions:
Always reset the busy flags that have been set
Ensure that 0 is not returned in error case
Avoid closing NULL directory descriptor
Use a timestamp having milliseconds precision to create the task DN

Issue: #6229

Reviewed by: @droideck (Thanks!)

(cherry picked from commit 04a0b6a)
progier389 added a commit that referenced this pull request Feb 6, 2025
…#6230)

* Issue 6229 - After an initial failure, subsequent online backups will not work

Several issues related to backup task error handling:
Backends stay busy after the failure
Exit code is 0 in some cases
Crash if failing to open the backup directory
And a more general one:
lib389 Task DN collision

Solutions:
Always reset the busy flags that have been set
Ensure that 0 is not returned in error case
Avoid closing NULL directory descriptor
Use a timestamp having milliseconds precision to create the task DN

Issue: #6229

Reviewed by: @droideck (Thanks!)

(cherry picked from commit 04a0b6a)
progier389 added a commit that referenced this pull request Feb 6, 2025
…#6230)

* Issue 6229 - After an initial failure, subsequent online backups will not work

Several issues related to backup task error handling:
Backends stay busy after the failure
Exit code is 0 in some cases
Crash if failing to open the backup directory
And a more general one:
lib389 Task DN collision

Solutions:
Always reset the busy flags that have been set
Ensure that 0 is not returned in error case
Avoid closing NULL directory descriptor
Use a timestamp having milliseconds precision to create the task DN

Issue: #6229

Reviewed by: @droideck (Thanks!)

(cherry picked from commit 04a0b6a)
progier389 added a commit that referenced this pull request Feb 6, 2025
…#6230)

* Issue 6229 - After an initial failure, subsequent online backups will not work

Several issues related to backup task error handling:
Backends stay busy after the failure
Exit code is 0 in some cases
Crash if failing to open the backup directory
And a more general one:
lib389 Task DN collision

Solutions:
Always reset the busy flags that have been set
Ensure that 0 is not returned in error case
Avoid closing NULL directory descriptor
Use a timestamp having milliseconds precision to create the task DN

Issue: #6229

Reviewed by: @droideck (Thanks!)

(cherry picked from commit 04a0b6a)
@progier389 progier389 deleted the i3967 branch May 20, 2025 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants