Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List possible gotchas #2

Open
sergiocorreia opened this issue Jul 25, 2017 · 6 comments
Open

List possible gotchas #2

sergiocorreia opened this issue Jul 25, 2017 · 6 comments
Assignees

Comments

@sergiocorreia
Copy link

sergiocorreia commented Jul 25, 2017

It might be good to have a comprehensive list of the things that could go wrong, so users know where to look at. For instance

  • SpookyHash 128 has a low collision chance if the inputs are uniformly distributed. But if the inputs follow patterns, that might not be true. For instance, this repo mentions that Spooky128 is "weak (collisions with 4bit diff)". Thus, even if the chance of a collision is low for random inputs, it might not be so for certain patterns. Moreover, even if the chance of a single user having a collision is low, but the chance of random users encountering collisions might be higher
  • Are there any overflow risks remaining? I saw some code about detecting whether an int. can overflow to double, but usually numerical software has a lot of checks for corner cases (e.g. if there are negative numbers).
  • Are there numerical precision issues? For instance, Mata does sum() computations as quads, and I think other software might also do so.
@mcaceresb mcaceresb self-assigned this Jul 26, 2017
mcaceresb added a commit that referenced this issue Jul 27, 2017
Enhancements

* `gegen varname = group(varlist)` no longer has holes, as noted in issue
  #4
* `gegen` and `gcollapse` fall back on `collapse` and `egen` in case there
  is a collision. Future releases will implement an internal way to resolve
  collisions. This is not a huge concern, as SpookyHash has no known
  vulnerabilities (I believe the concern raied in issue #2
  was base on a typo; see [here](rurban/smhasher#34))
  and the probability of a collision is very low.
* `gegen varname = group(varlist)` now has a consistency test (though
  the group IDs are not the same as `egen`'s, they should map to the `egen`
  group IDs 1 to 1, which is what the tests now check for).

Bug fixes

* Additional fixes for issue #1
* Apparentlly the argument Stata passes to plugins have a maximum length. The
  code now makes sure chuncks are passed when the PATH length will exceed the
  maximum. The plugin later concatenates the chuncks to set the PATH correctly.
@mcaceresb
Copy link
Owner

  1. I'm pretty sure that repo's table has a typo (I opened an issue about it here). Nevertheless, for the time-being I have implemented a fallback option: oncollision(fallback|fail)

The default is fallback and that will simply run the regular collapse if there is a collision. This negates the speed gains if it encounters a collision, but collisions should be really rare so I think this is OK for now. Ultimately I would implement a way to resolve collisions internally in C, but it would take some work (my idea is to create a multi-dimensional array with the keys from the bad hash, assign new hashes, and replace the bad hash with the new hashes).

  1. Has there been an overflow so far? The prior crashes were both memory crashes. Internally, all computations are done in double precision, and in Stata I take some care to upgrade the variable data type if it will be required.

  2. That is interesting; if sum() is done at quad precision, that would also be the case with means and standard deviations, right? This would add complexity to the code... To save memory, the best thing would be to switch to quad precision if max() * _N would overflow (max()^2 in the case of sd). I'll mull it over.

@sergiocorreia
Copy link
Author

TBH I wouldn't worry that much about collisions right now: I looked for collisions by creating large datasets with unique identifiers and looking for duplicate IDs coming from gegen, and couldn't find anything with integer values (didn't test for strings or multiple variables). So at least in practice it seems that collisions are extremely unlikely.

What would be really useful, as discussed, would be a program that does egen group+tag+count, as it would allow others to speed up their own commands based on this (kinda how ftools can be used to build fmerge, etc.)

Regarding 2 and 3, my guess is that the best thing would be to create large random test files, collapse them both ways, and check that the numbers match within some epsilon.

Something like this do file perhaps, but with more obs.

@mcaceresb
Copy link
Owner

mcaceresb commented Jul 27, 2017

I agree, but for the sake of completeness I think it's OK (and the concern about SpookyHash was confirmed to be a typo anyway; see this commit).

I already have these checks (see here; I compare gtools to collapse and egen for all the functions). So there are two concerns: Whether sums of very small numbers will loose precision (and by extension, means and sds) and whether very large numbers will overflow (ibid). Do you know how to check this? When I try to push Stata to its limits in those directions, I get missing values. I will search empirically when I get a chance, but if you already have something in mind let me know.

mcaceresb added a commit that referenced this issue Jul 27, 2017
gtools-0.6.5 through gtools-0.6.9

Enhancements

* Addressed the possible issue noted in issue
  #3 and the functions now
  use mata and extended macro functions as applicable.
* `gegen varname = group(varlist)` no longer has holes, as noted in issue
  #4
* `gegen` and `gcollapse` fall back on `collapse` and `egen` in case there
  is a collision. Future releases will implement an internal way to resolve
  collisions. This is not a huge concern, as SpookyHash has no known
  vulnerabilities (I believe the concern raied in issue #2
  was base on a typo; see [here](rurban/smhasher#34))
  and the probability of a collision is very low.
* `gegen varname = group(varlist)` now has a consistency test (though
  the group IDs are not the same as `egen`'s, they should map to the `egen`
  group IDs 1 to 1, which is what the tests now check for).
* The function now checks numerical variabes to see if they are integers.
  Working with integers is faster than hashing.
* The function is now smarter about generating targets. In prior versions,
  when the target statistic was a sum the function would force the target
  type to be `double`. Now if the source already exists and is a float, the
  function now checks if the resultimg sum would overflow. It will only
  recast the source as double for collapsing if the sum might overflow, that
  is, if `_N * min < -10^38` or `10^38 < _N * max` (note +/- 10^38 are the
  largest/smallest floats stata can represent; see `help data_types`).

Bug fixes

* `gegen` no longer ignores unavailable options, as noted in issue
  #4, and now it throws an error.
* `gegen varname = tag(varlist)` no longer tags missing values, as noted
  in issue #5
* Additional fixes for issue #1
* Apparentlly the argument Stata passes to plugins have a maximum length. The
  code now makes sure chuncks are passed when the PATH length will exceed the
  maximum. The plugin later concatenates the chuncks to set the PATH correctly.
* Fixed issue #1
* The problem was that the wrapper I wrote to print to the Stata
  console has a maximum buffer size; when it tries to print the
  new PATH it encounters an error when the string is longer than
  the allocated size. Since printing this is unnecessary and
  will only ever be used for debugging, I no longer print the PATH.
* Debugging issue #1
  on github (in particular, `env_set` on Windows).
* Removed old debugging code that had been left uncommented
* Improved out-of-memory message (now links to relevant help section).
@mcaceresb
Copy link
Owner

Re: Are there any overflow risks remaining?

I have normalized the code base to use 64-bit integers (signed and unsigned as applicable) everywhere. That is, I have moved away from C's int and size_t which can be as low as 16-bit and 32-bit on some systems. I have also improved internal checks to make sure the bijection does not overflow, etc.

@sergiocorreia
Copy link
Author

That's quite useful. Btw and just out of curiosity, what do you see as the next steps?

@mcaceresb
Copy link
Owner

Nothing for now, really. I want to add examples to the documentation I wrote. I hadn't done it in sthlp files because they're a pain to write, but markdown is fine.

Now that I have an OSX version I'll focus on bug fixes and stability. Check corner cases, improve test coverage, minimize memory use, etc. I don't r really have any new features planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants