Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C solution 2: Don't over-allocate #406

Merged
merged 3 commits into from
Jul 20, 2021

Conversation

tjol
Copy link
Contributor

@tjol tjol commented Jul 13, 2021

Description

C solution 2 was allocating 8x the space required.

Contributing requirements

  • I read the contribution guidelines in CONTRIBUTING.md.
  • I placed my solution in the correct solution folder.
  • I added a README.md with the right badge(s).
  • I added a Dockerfile that builds and runs my solution.
  • I selected drag-race as the target branch.
  • All code herein is licensed compatible with BSD-3.

@rbergen
Copy link
Contributor

rbergen commented Jul 13, 2021

Tagging @danielspaangberg and allowing time so they can review and respond to this PR on their original solution, if they want to do so.

@danielspaangberg
Copy link
Contributor

Thanks @tjol for noticing and @rbergen for notifying me. You're right, I allocate 8 times as much as needed, but I think your solution needs to be changed slightly. For example for 1'000'000 integers and a 64 bit storage your code would allocate 7812 uint64_t, but they only hold 499968 bits. So I think your solution is good, but there should be (maxints+16*sizeof(TYPE)-1)/16/sizeof(TYPE).

@tjol
Copy link
Contributor Author

tjol commented Jul 13, 2021

You're right, of course.

@danielspaangberg
Copy link
Contributor

@tjol That looks good to me! @rbergen - is any action required by me at this point?

@rbergen
Copy link
Contributor

rbergen commented Jul 13, 2021

@danielspaangberg No, you've done an excellent job by responding so quickly and being so eagle-eyed. Thank you!

@rbergen
Copy link
Contributor

rbergen commented Jul 13, 2021

When I run the latest commit locally in docker, this is what I get:

danielspaangberg_1of2;6728;5.000034;1;algorithm=base,faithful=yes,bits=1
danielspaangberg_8of30;8816;5.000249;1;algorithm=wheel,faithful=yes,bits=1
Segmentation fault
danielspaangberg_48of210;9936;5.000007;1;algorithm=wheel,faithful=yes,bits=1
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
danielspaangberg_1of2_epar;13078;5.007350;8;algorithm=base,faithful=yes,bits=1
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault

@tjol
Copy link
Contributor Author

tjol commented Jul 13, 2021

Well that's interesting...

I didn't get any segmentation faults running locally, but I do in docker. Curious.

@danielspaangberg
Copy link
Contributor

Ouch. Will look into what is wrong. I got a segfault running locally when linking with electric fence.

@danielspaangberg
Copy link
Contributor

danielspaangberg commented Jul 13, 2021

I have a bug with a loop test terminating a bunch of loops with
for (unsigned int i=(factor*factor)>>1U; i<=maxintsh; )
which should be
for (unsigned int i=(factor*factor)>>1U; i<maxintsh; )
and with 32 bit ints it actually allocates exactly the number of integers required, so going off by one here triggers a segfault.
In some of the codes this was fixed already, why the segmentation fault does not always happen.
Problem - I am on vacation and cannot fix this with the computer I am currently at (not skilled enough with the git command line unfortunately). I can update the code when I get back home in a week.

@rbergen
Copy link
Contributor

rbergen commented Jul 13, 2021

@danielspaangberg Again, impressively eagle-eyed! My advice would be to enjoy your vacation while it lasts. If this ends up taking a week to get fixed, that's just life.

@tjol
Copy link
Contributor Author

tjol commented Jul 13, 2021

@danielspaangberg I've come to the same conclusion – it looks like that's where the problem is. For *_epar and *_only_write_read_bits, changing the <= to a < fixes the problem. However, for *_par, this ends up giving an invalid result. This appears to be due to the way you're splitting up the work between threads, which I do not fully understand.

For the *_par solutions, I get the correct result by leaving the <= intact and replacing

 if (ithread==numthreads-1)
      mylastbit=lastbit;

with

 if (ithread==numthreads-1)
      mylastbit=lastbit-1;

While this works, I do not expect that it's an ideal solution. If you agree, I'm happy to commit it – otherwise I'll wait for you to do the job properly in a week or two.

Enjoy the rest of your holiday!

@tjol tjol marked this pull request as draft July 13, 2021 20:52
@danielspaangberg
Copy link
Contributor

@danielspaangberg I've come to the same conclusion – it looks like that's where the problem is. For *_epar and *_only_write_read_bits, changing the < to a <= fixes the problem. However, for *_par, this ends up giving an invalid result. This appears to be due to the way you're splitting up the work between threads, which I do not fully understand.

For the *_par solutions, I get the correct result by leaving the <= intact and replacing

 if (ithread==numthreads-1)
      mylastbit=lastbit;

with

 if (ithread==numthreads-1)
      mylastbit=lastbit-1;

While this works, I do not expect that it's an ideal solution. If you agree, I'm happy to commit it – otherwise I'll wait for you to do the job properly in a week or two.

Enjoy the rest of your holiday!

Great and thanks! It all looks OK to me, but since it is hard for me to test right now, I can do a proper check next week. The tricky thing with the parallel version masking of the bits, is making sure no two threads touch the same integer. And it is hard to test, since the results will only get garbled sometimes. So I'll read through that part of the code again later.

@fvbakel
Copy link
Contributor

fvbakel commented Jul 19, 2021

I took the freedom to learn from this PR and the original code of sieve_1of2.c and build my own idea on top of that. Doing so I found a faster way to allocate the memory than what is described above. See my comments and code in PR #422. Next to that I found a bug in sieve_1of2.c, if you set the maxints to 97, 97 itself is not counted and the number of reported primes is off by one. I solved that to in the solution of PR #422. Maybe you can find a good way to solve that bug in this PR too.

@danielspaangberg
Copy link
Contributor

I took the freedom to learn from this PR and the original code of sieve_1of2.c and build my own idea on top of that. Doing so I found a faster way to allocate the memory than what is described above. See my comments and code in PR #422. Next to that I found a bug in sieve_1of2.c, if you set the maxints to 97, 97 itself is not counted and the number of reported primes is off by one. I solved that to in the solution of PR #422. Maybe you can find a good way to solve that bug in this PR too.

Thanks @fvbakel ! I will look at this as soon as I have checked the other bugs. I looked at your code, but cannot understand how that allocates memory faster? The code is simpler for sure, but the performance should not be that different?

@fvbakel
Copy link
Contributor

fvbakel commented Jul 20, 2021

@danielspaangberg I was surprised by the difference in performance too. I have to check if I can really measure a difference when I run multiple times. I found that on my under powered hardware there can sometimes be a bit of variation in performance.

@danielspaangberg
Copy link
Contributor

@danielspaangberg I've come to the same conclusion – it looks like that's where the problem is. For *_epar and *_only_write_read_bits, changing the < to a <= fixes the problem. However, for *_par, this ends up giving an invalid result. This appears to be due to the way you're splitting up the work between threads, which I do not fully understand.
For the *_par solutions, I get the correct result by leaving the <= intact and replacing

 if (ithread==numthreads-1)
      mylastbit=lastbit;

with

 if (ithread==numthreads-1)
      mylastbit=lastbit-1;

While this works, I do not expect that it's an ideal solution. If you agree, I'm happy to commit it – otherwise I'll wait for you to do the job properly in a week or two.
Enjoy the rest of your holiday!

Great and thanks! It all looks OK to me, but since it is hard for me to test right now, I can do a proper check next week. The tricky thing with the parallel version masking of the bits, is making sure no two threads touch the same integer. And it is hard to test, since the results will only get garbled sometimes. So I'll read through that part of the code again later.

So I have read through the code and tested a bit - I can see no problems with your changes @tjol - they solve the issues for the parallel code as well. Thanks for all the changes and fixes!!

@danielspaangberg
Copy link
Contributor

@danielspaangberg I was surprised by the difference in performance too. I have to check if I can really measure a difference when I run multiple times. I found that on my under powered hardware there can sometimes be a bit of variation in performance.

@fvbakel - I checked the bug with the odd integers as you mention. I can add a simple fix for this in this solution as well (basically just maxints+=maxints%2 in create sieve will solve it), and will do soon. Thanks for noticing. I also looked at your segmented solution, which is clever. I did some experiments with a segmented sieve, but could not make it run faster for only one million ints. I have looked more at the code around the calloc, and do not understand how it can be much different in speed, if I am not missing anything essential, it is just the code to compute how many TYPE should be allocated, which should now in the code by @tjol just compile to an add and and a shift? I just checked the compiler output and that's basically what's there (a couple of moves too).

@fvbakel
Copy link
Contributor

fvbakel commented Jul 20, 2021

@danielspaangberg Thanks for the feedback. I think the difference in performance of the memory allocation can be considered a glitch in my testing. It is indeed just the calculation of the amount. On Alpine, the segmented solution is slower because alpine has a bad implementation of memcpy.

@danielspaangberg
Copy link
Contributor

@danielspaangberg Thanks for the feedback. I think the difference in performance of the memory allocation can be considered a glitch in my testing. It is indeed just the calculation of the amount. On Alpine, the segmented solution is slower because alpine has a bad implementation of memcpy.

Nice! Good to know - I was just staring at the code for a while. I tested your code on debian under wsl2 and your code ran almost twice as fast as the 1of2. I think I tried my segmented code to be too generic. It is clever to do the first pass just for the first word.

@fvbakel
Copy link
Contributor

fvbakel commented Jul 20, 2021

@danielspaangberg Thank you, they only remaining huddle is to get that PR merged.

@tjol
Copy link
Contributor Author

tjol commented Jul 20, 2021

Do I understand correctly that we're all happy for this PR to be reviewed and merged in its current form? (In additional to @fvbakel’s new and improved version)

@tjol tjol marked this pull request as ready for review July 20, 2021 15:23
@danielspaangberg
Copy link
Contributor

Do I understand correctly that we're all happy for this PR to be reviewed and merged in its current form? (In additional to @fvbakel’s new and improved version)

This PR (your, @tjol, changes) at least I am completely happy with!! The additions @fvbakel has done regarding the segmented version, is in a separate PR, right? The changes suggested by @fvbakel for the arguments of calloc, I do not see as necessary in this PR and no further changes are required. But, there might be additional changes needed, depending on how we see what the maxint argument means. If we give an odd integer as maxint, and that integer corresponds to a prime, should that be counted? For instance 97 is prime, but is currently not counted by this implementation. Do we count primes below maxint or below and including maxint?

@rbergen
Copy link
Contributor

rbergen commented Jul 20, 2021

Do we count primes below maxint or below and including maxint?

@danielspaangberg, @tjol (and others), what you decide concerning this PR is entirely up to you. I just want to note that we have not "standardized" this from a maintainer's perspective. An important reason for that is that we focus on a sieve size of an even number, which means either approach works.

For now, I'll just wait for a comment that says that I can indeed review this PR towards merging it.

@danielspaangberg
Copy link
Contributor

@rbergen - @tjol marked it as ready for review and I am also happy with this. I have tested with electric fence outside of docker, and with docker as well. So it should be fine for reviewing and merging.

@rbergen rbergen merged commit f97cf59 into PlummersSoftwareLLC:drag-race Jul 20, 2021
@danielspaangberg
Copy link
Contributor

Thanks @rbergen - really appreciate good management!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants