Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconstruction get stuck with minimal missing samples #29

Open
leobago opened this issue Mar 27, 2023 · 4 comments
Open

Reconstruction get stuck with minimal missing samples #29

leobago opened this issue Mar 27, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@leobago
Copy link
Collaborator

leobago commented Mar 27, 2023

When executed with the following parameters:

randomSeed = "DAS"

and

INFO : Simulator : Shape: {'run': 0, 'numberNodes': 256, 'blockSize': 32, 'failureRate': 10, 'netDegree': 8, 'class1ratio': 0.8, 'chi': 2, 'vpn1': 1, 'vpn2': 500, 'bwUplinkProd': 2200, 'bwUplink1': 110, 'bwUplink2': 2200, 'randomSeed': 'DAS-bs-32-nn-256-fr-10-c1r-0.8-chi-2-vpn1-1-vpn2-500-bwupprod-2200-bwup1-110-bwup2-2200-nd-8-r-0'} ... Block Available: 0 in 69 steps

the block does not become available.
The weird thing is that it gets block when missing only 140 samples, here are the last lines of the XML dump:

	<item type="int">316</item>
	<item type="int">279</item>
	<item type="int">210</item>
	<item type="int">157</item>
	<item type="int">140</item>
	<item type="int">140</item>

For a block with 1024 samples and a network of 256 nodes, I think it is virtually impossible to get stuck at this stage. If the 140 missing samples are distributed among multiple nodes, then any node with less than 16 missing samples should be able to reconstruct them. If all of the missing samples are in one single node then the other nodes holding the same rows/columns should have the entire data.

@cskiraly cskiraly added the bug Something isn't working label Mar 27, 2023
@cskiraly
Copy link
Contributor

I was already debugging similar. Best way was for me to

  • enable TRACE logging
  • check which node misses what segment at the end
  • trace back where it should have received from

Can you add:

@leobago
Copy link
Collaborator Author

leobago commented Mar 28, 2023

Just the current develop branch (commit bcf3098).
The bug happens with almost any configuration.
If you want to reproduce the exact same case I posted, please see the shape info given in the issue.
If you want to make it happen quick I recommend the following config:

dumpXML = 1
visualization = 1
logLevel = logging.INFO
numJobs = -1
evenLineDistribution = True
runs = range(2)
numberNodes = range(256, 313, 128)
failureRates = range(10, 31, 40)
blockSizes = range(32,35,16)
netDegrees = range(8, 9, 2)
chis = range(2, 3, 2)
class1ratios = [0.8]
validatorsPerNode1 = [1]
validatorsPerNode2 = [500]
bwUplinksProd = [2200]
bwUplinks1 = [110]
bwUplinks2 = [2200]
deterministic = True
randomSeed = "DAS"

@leobago
Copy link
Collaborator Author

leobago commented Mar 28, 2023

I just discovered that these are the nodes with missing samples at the end. Other nodes should have that data so all of them should be able to reconstruct those samples.

WARNING : Simulator : Node 36 is missing 17 samples
WARNING : Simulator : Node 59 is missing 19 samples
WARNING : Simulator : Node 78 is missing 17 samples
WARNING : Simulator : Node 127 is missing 17 samples
WARNING : Simulator : Node 157 is missing 17 samples
WARNING : Simulator : Node 161 is missing 17 samples
WARNING : Simulator : Node 193 is missing 18 samples
WARNING : Simulator : Node 198 is missing 18 samples

@leobago
Copy link
Collaborator Author

leobago commented Mar 28, 2023

Same case, focusing on Node 36 and its row 30 neighbors:

WARNING : Simulator : Node 36 is missing 17 samples
Row 30: bitarray('00101110101110010010110000010110')
Row 30, Neighbor 19 sent: bitarray('00000100000000000000000000010010')
Row 30, Neighbor 19 has: bitarray('11111111111111111111111111111111')
Row 30, Neighbor 8 sent: bitarray('00000010000000000000000000000100')
Row 30, Neighbor 8 has: bitarray('11111111111111111111111111111111')
Row 30, Neighbor 1 sent: bitarray('00000000100100000000000000000000')
Row 30, Neighbor 1 has: bitarray('11111111111111111111111111111111')
Row 30, Neighbor 6 sent: bitarray('00000000000010000000010000000000')
Row 30, Neighbor 6 has: bitarray('11111111111111111111111111111111')
Row 30, Neighbor 23 sent: bitarray('00001000000000000000000000000000')
Row 30, Neighbor 23 has: bitarray('11111111111111111111111111111111')
Row 30, Neighbor 16 sent: bitarray('00100000000000010010000000000000')
Row 30, Neighbor 16 has: bitarray('11111111111111111111111111111111')

It seems clear the neighbors should have sent all the data they have, but for some reason they are not doing so.

@cskiraly cskiraly mentioned this issue Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants