Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes script dies on AssertationError #4

Closed
Dragonn opened this issue Dec 2, 2021 · 5 comments
Closed

Sometimes script dies on AssertationError #4

Dragonn opened this issue Dec 2, 2021 · 5 comments
Labels
bug Something isn't working

Comments

@Dragonn
Copy link

Dragonn commented Dec 2, 2021

We have 5-node cluster in 5 racks with only replicated rules. Sometimes script dies with following AssertationError:

-> ./placementoptimizer.py -v balance --max-pg-moves 20 | tee 2021-12-02_balance-upmaps_6
[2021-12-02 16:24:07,886] running pg balancer
[2021-12-02 16:24:07,900] current OSD fill rate per crushclasses:
[2021-12-02 16:24:07,901]   ssd: average=53.83%, median=52.49%, without_placement_constraints=51.36%
[2021-12-02 16:24:07,902] cluster variance for crushclasses:
[2021-12-02 16:24:07,902]   ssd: 12.010
[2021-12-02 16:24:07,902] min osd.22 48.726%
[2021-12-02 16:24:07,902] max osd.11 59.926%
[2021-12-02 16:24:07,910]   SAVE move 6.1f osd.11 => osd.8 (size=8.5G)
[2021-12-02 16:24:07,910]     => variance new=11.354265826467204 < 12.009990670791638=old
[2021-12-02 16:24:07,910]     new min osd.22 48.726%
[2021-12-02 16:24:07,910]         max osd.25 59.629%
[2021-12-02 16:24:07,910]     new cluster variance:
[2021-12-02 16:24:07,910]       ssd: 11.354
[2021-12-02 16:24:07,917]   SAVE move 8.1e osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:24:07,917]     => variance new=10.442261655175155 < 11.354265826467204=old
[2021-12-02 16:24:07,917]     new min osd.8 49.928%
[2021-12-02 16:24:07,917]         max osd.4 59.142%
[2021-12-02 16:24:07,917]     new cluster variance:
[2021-12-02 16:24:07,917]       ssd: 10.442
[2021-12-02 16:24:07,924]   SAVE move 8.64 osd.4 => osd.8 (size=12.6G)
[2021-12-02 16:24:07,924]     => variance new=9.682937891685194 < 10.442261655175155=old
[2021-12-02 16:24:07,924]     new min osd.18 49.975%
[2021-12-02 16:24:07,924]         max osd.12 59.027%
[2021-12-02 16:24:07,924]     new cluster variance:
[2021-12-02 16:24:07,925]       ssd: 9.683
[2021-12-02 16:24:07,931]   SAVE move 8.ac osd.12 => osd.21 (size=12.4G)
[2021-12-02 16:24:07,931]     => variance new=8.953738197324748 < 9.682937891685194=old
[2021-12-02 16:24:07,931]     new min osd.18 49.975%
[2021-12-02 16:24:07,932]         max osd.11 58.975%
[2021-12-02 16:24:07,932]     new cluster variance:
[2021-12-02 16:24:07,932]       ssd: 8.954
[2021-12-02 16:24:07,939]   SAVE move 6.11 osd.11 => osd.18 (size=8.5G)
[2021-12-02 16:24:07,939]     => variance new=8.427594885907181 < 8.953738197324748=old
[2021-12-02 16:24:07,939]     new min osd.22 50.116%
[2021-12-02 16:24:07,939]         max osd.25 58.239%
[2021-12-02 16:24:07,939]     new cluster variance:
[2021-12-02 16:24:07,939]       ssd: 8.428
[2021-12-02 16:24:07,946]   SAVE move 8.d9 osd.25 => osd.22 (size=12.4G)
[2021-12-02 16:24:07,946]     => variance new=7.783740064149625 < 8.427594885907181=old
[2021-12-02 16:24:07,946]     new min osd.1 50.187%
[2021-12-02 16:24:07,946]         max osd.11 58.027%
[2021-12-02 16:24:07,946]     new cluster variance:
[2021-12-02 16:24:07,946]       ssd: 7.784
[2021-12-02 16:24:07,953]   SAVE move 6.e osd.11 => osd.1 (size=8.4G)
[2021-12-02 16:24:07,953]     => variance new=7.334701606818667 < 7.783740064149625=old
[2021-12-02 16:24:07,954]     new min osd.9 50.477%
[2021-12-02 16:24:07,954]         max osd.6 57.919%
[2021-12-02 16:24:07,954]     new cluster variance:
[2021-12-02 16:24:07,954]       ssd: 7.335
[2021-12-02 16:24:07,961]   SAVE move 8.f1 osd.6 => osd.9 (size=13.0G)
[2021-12-02 16:24:07,961]     => variance new=6.732430912736088 < 7.334701606818667=old
[2021-12-02 16:24:07,961]     new min osd.18 50.923%
[2021-12-02 16:24:07,961]         max osd.4 57.731%
[2021-12-02 16:24:07,961]     new cluster variance:
[2021-12-02 16:24:07,961]       ssd: 6.732
[2021-12-02 16:24:07,968]   SAVE move 8.1a osd.4 => osd.29 (size=12.5G)
[2021-12-02 16:24:07,969]     => variance new=6.21303665183377 < 6.732430912736088=old
[2021-12-02 16:24:07,969]     new min osd.18 50.923%
[2021-12-02 16:24:07,969]         max osd.12 57.635%
[2021-12-02 16:24:07,969]     new cluster variance:
[2021-12-02 16:24:07,969]       ssd: 6.213
[2021-12-02 16:24:07,975]   SAVE move 8.2c osd.12 => osd.14 (size=12.4G)
[2021-12-02 16:24:07,975]     => variance new=5.7103134469222105 < 6.21303665183377=old
[2021-12-02 16:24:07,976]     new min osd.18 50.923%
[2021-12-02 16:24:07,976]         max osd.16 57.633%
[2021-12-02 16:24:07,976]     new cluster variance:
[2021-12-02 16:24:07,976]       ssd: 5.710
[2021-12-02 16:24:07,982]   SAVE move 6.1f osd.16 => osd.18 (size=8.5G)
[2021-12-02 16:24:07,982]     => variance new=5.332574718339378 < 5.7103134469222105=old
[2021-12-02 16:24:07,982]     new min osd.26 51.099%
[2021-12-02 16:24:07,982]         max osd.11 57.083%
[2021-12-02 16:24:07,982]     new cluster variance:
[2021-12-02 16:24:07,982]       ssd: 5.333
[2021-12-02 16:24:07,988]   SAVE move 6.8 osd.11 => osd.26 (size=8.4G)
[2021-12-02 16:24:07,988]     => variance new=5.004755568105044 < 5.332574718339378=old
[2021-12-02 16:24:07,989]     new min osd.1 51.131%
[2021-12-02 16:24:07,989]         max osd.23 57.009%
[2021-12-02 16:24:07,989]     new cluster variance:
[2021-12-02 16:24:07,989]       ssd: 5.005
Traceback (most recent call last):
  File "./placementoptimizer.py", line 1917, in <module>
    try_pg_move.prepare_crush_check()
  File "./placementoptimizer.py", line 984, in prepare_crush_check
    assert reuses == uses
AssertionError

I will be honest I didn't really tried to dive into code and algorithms, but based on error message I have no idea what did I do wrong. Is it even supposed to work on such small cluster as I have?

Btw I can fix it by limiting number of moves with --max-pg-moves so I can avoid it.

I will be happy to give you any debug info, just tell me what I can do.

@TheJJ
Copy link
Owner

TheJJ commented Dec 6, 2021

Thanks for the report! I'm not sure why this happens, but I guess it's because you have racks (and i haven't tested the script in an environment where there's more than hosts).
Can you run again with debug loggin, i.e. -vv, and send me the output for the whole movement calculation (starting at the last time it prints TRY-0 moving)?
The error basically means that we used an item (a bucket like rack or host, or an osd) not as often as a layer in the crush hierarchy should be used.

@TheJJ TheJJ added the bug Something isn't working label Dec 6, 2021
@Dragonn
Copy link
Author

Dragonn commented Dec 6, 2021

Here you are, as requested:

[2021-12-06 13:56:18,252] TRY-0 moving pg 8.d9 (1/100) with 13.2G from osd.23
[2021-12-06 13:56:18,252]   OK => taking pg 8.d9 from source osd.23 since it has too many of pool=8 (28 > 25.611671723449557)
[2021-12-06 13:56:18,253] prepare crush check for rule
{'name': 'replicated_rule',
 'steps': [{'item': -1, 'item_name': 'default', 'op': 'take'},
           {'num': 0, 'op': 'chooseleaf_firstn', 'type': 'rack'},
           {'op': 'emit'}]}
[2021-12-06 13:56:18,253] allowed reuses per rule step, starting at root: [3, 1, 1]
[2021-12-06 13:56:18,253] processing crush step {'op': 'take', 'item': -1, 'item_name': 'default'} with tree_depth=0, rule_depth=0
[2021-12-06 13:56:18,253] processing crush step {'op': 'chooseleaf_firstn', 'num': 0, 'type': 'rack'} with tree_depth=1, rule_depth=1
[2021-12-06 13:56:18,253] processing crush step {'op': 'emit'} with tree_depth=4, rule_depth=2
Traceback (most recent call last):
  File "./placementoptimizer.py", line 1917, in <module>
    try_pg_move.prepare_crush_check()
  File "./placementoptimizer.py", line 984, in prepare_crush_check
    assert reuses == uses
AssertionError

Also I tried to add debug code print(idx, reuses, item, uses) on line 984 it outputed following right before Traceback:

0 3 -1 3
1 1 -18 2

Not sure if it helps

@TheJJ
Copy link
Owner

TheJJ commented Jan 20, 2022

I think the assertion triggers because the crush map itself fails to validate. Meanwhile I've created support for collecting the cluster state which you can send me then. See #5 :)

@Dragonn
Copy link
Author

Dragonn commented Jan 20, 2022

Thanks for info, I am currently unable to replicate this issue with latest version of script. I tried to generate more PG moves and it didn't crush either.

Are you okay if I close this issue and reopen it later if error reoccurs?

@TheJJ
Copy link
Owner

TheJJ commented Jan 20, 2022

sure thing :) maybe i did fix something meanwhile

@Dragonn Dragonn closed this as completed Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants