Skip to content

Conversation

@1RyanK
Copy link
Contributor

@1RyanK 1RyanK commented Oct 6, 2025

I mainly was trying to fix issues surrounding things like ak.array([-2**200]), but while I was messing around in ak.array I noticed a few other things that needed fixing.

This is needed for #4593.

Summary of Changes

  • Added an early str check to raise a clear TypeError for scalar string inputs.
  • Tightened iterable handling to convert only generators/ranges to lists, leaving np.ndarray and pd.Series intact.
  • Refined unsigned integer inference to require non-negative values and include values ≥ 2⁶³.
  • Added automatic bigint inference for object arrays containing only integers.
  • Reworked negative bigint handling to properly construct signed bigints using a sign mask.
  • Added tests for large negative bigint values, range conversion, unsigned inference, and mixed-sign behavior.

Purpose:
Fixes incorrect bigint conversion for large negative numbers and cleans up input normalization and dtype inference logic.

Closes #4984: ak.array with negative numbers still has problems

@1RyanK 1RyanK force-pushed the 4984-ak.array_with_negative_numbers_still_has_problems branch 7 times, most recently from 8b1ba0f to 3f7eb0f Compare October 9, 2025 12:37
@1RyanK 1RyanK marked this pull request as ready for review October 9, 2025 13:20
@1RyanK 1RyanK added the blocking This is blocking a developer from completing a task they are actively working. label Oct 9, 2025
Copy link
Contributor

@ajpotts ajpotts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for now. We need to refactor this ak.array function to clean up the logic, but I created a separate issue for that: 4990

@1RyanK 1RyanK force-pushed the 4984-ak.array_with_negative_numbers_still_has_problems branch 2 times, most recently from c65fa9e to 1ce06f9 Compare October 14, 2025 15:19
@1RyanK 1RyanK force-pushed the 4984-ak.array_with_negative_numbers_still_has_problems branch from 1ce06f9 to a783f45 Compare October 15, 2025 23:55
Copy link
Contributor

@jaketrookman jaketrookman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link
Contributor

@drculhane drculhane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the unit tests, and also the one specific example you cited when you created the issue. Looks good. I concur that ak.array has become a bit of a mess, and glad to see that we now have an issue for that, too.

@ajpotts ajpotts added this pull request to the merge queue Oct 17, 2025
Merged via the queue into Bears-R-Us:main with commit 353ca4e Oct 17, 2025
21 checks passed
github-merge-queue bot pushed a commit that referenced this pull request Dec 4, 2025
…es (#5044)

While PR #4985 technically fixed the issue of general negative bigint
problems, it caused a performance regression. I believe this is
primarily due to the overhead of creating the sign array. Instead,
here's basically what happens in the code:

```python
    any_neg = np.any(flat < 0)
    req_bits: int
    if any_neg:
        req_bits = max(flat.max().bit_length(), (-flat.min()).bit_length()) + 1
    else:
        req_bits = flat.max().bit_length()
```
Then the code figures out how many times to pull off `uint64` limbs,
rather than waiting until everything is zero (which doesn't happen in
the negative case, it just continues to be -1).

The code runs out to ~~two~~ three separate Chapel functions depending
on the case. If the input is just an `int64` or `uint64` array, it
converts it directly to bigint (similar for `float64` or some other kind
of floating point input). If the input is numpy's version of a bigint
array, it goes to the multi-limb version (unfortunately, this is the
case even if everything comes out to just one limb, but I think the
performance loss here is not that bad). If the input data had any
negative values, it treats the limbs as signed (all bits are positive
and the top bit is negative). However, it's hard to create a bigint like
this (AFAIK). So Chapel-side, it creates a signs array, strips off the
top bit of every limb and treats it as a bool to reference later. Either
way, it goes to the "Horner fold" step, which, as ChatGPT tells me, is
possibly a faster way to create the bigints. Previously the code was bit
shifting the limbs into the right spot and then adding them to the
bigint value. The idea here is that you start with the highest limb of
the data, then you bitshift it and add in the next limb, bitshift what
you have and add in the next limb, and so on. You can read more about
the generic version of this
[here](https://en.wikipedia.org/wiki/Horner%27s_method) (take x =
`2**64`). Then it adds in the signed bit as necessary.

It also handles the case where `max_bits` is not -1.

Hopefully any loss in performance is made up by a few factors:

1. Previously the Python code was stripping the limbs off by modding by
`2**64` and then integer dividing by the same value. I think it could
speed the code up to do a bitmask by `2**64 - 1` and then bitshift by
64.
2. Supposedly the Horner fold has better performance.
3. If the input is only one limb of `int64` or `uint64` data, it should
go to the single limb version and that should run quicker.

~~This also handles all cases of numeric data to bigint output in a
single function, so if the performance is back up, then I can cut out
some code in the array function.~~

I went ahead and cut the old bigint code out.

Closes #5043: Investigate performance loss from negative bigint changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocking This is blocking a developer from completing a task they are actively working.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ak.array with negative numbers still has problems

4 participants