In [1]:
import pandas

  from .tslib import iNaT, NaT, Timestamp, Timedelta, OutOfBoundsDatetime
  from pandas._libs import (hashtable as _hashtable,
  from pandas._libs import algos, lib
  from pandas._libs import hashing, tslib
  from pandas._libs import (lib, index as libindex, tslib as libts,
  import pandas._libs.tslibs.offsets as liboffsets
  from pandas._libs import algos as libalgos, ops as libops
  from pandas._libs.interval import (
  from pandas._libs import internals as libinternals
  import pandas._libs.sparse as splib
  import pandas._libs.window as _window
  from pandas._libs import (lib, reduction,
  from pandas._libs import algos as _algos, reshape as _reshape
  import pandas._libs.parsers as parsers
  from pandas._libs import algos, lib, writers as libwriters


# Jagged arrays and tables

This notebook demonstrates how Numpy idioms have been extended for jagged (i.e. "ragged") arrays, which are nested arrays that do not have a fixed number of elements in each subarray.

`JaggedArray` and `Table` are both represented as columns in memory: like-attributes are contiguous in memory and unlike-attributes may be anywhere in memory. The data structures are designed to compose, allowing for multiple levels of jaggedness or nested tables.

In [2]:
import os
os.chdir(os.path.expanduser("~"))

import pandas
import numpy
from awkward import JaggedArray, Table

Let's start with a simple jagged array: it contains four subarrays, whose lengths are 3, 0, 2, and 5.

In [3]:
a = JaggedArray.fromiter([[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5, 6.6, 7.7, 8.8, 9.9]])
a

<JaggedArray [[0.  1.1 2.2] [] [3.3 4.4] [5.5 6.6 7.7 8.8 9.9]] at 724b54a5c610>

Although it is presented as a list of lists, it consists of three Numpy arrays: `starts` (indexes where each starts), `stops` (index where each stops), and `content` (the data themselves, may be any type).

In [4]:
print(a.starts)
print(a.stops)
print(a.content)

[0 3 3 5]
[ 3  3  5 10]
[0.  1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9]


With this data representation, we can take advantage of Numpy for fast calculations. Numpy ufuncs (such as this addition by a constant) are passed to the `content` array. The `starts` and `stops` are unchanged and therefore shared.

In [5]:
a + 100

<JaggedArray [[100.  101.1 102.2] [] [103.3 104.4] [105.5 106.6 107.7 108.8 109.9]] at 724b54a313d0>

But since jagged arrays have more structure than Numpy arrays, we have to extend the broadcasting rules. A jagged array plus a flat array adds each jagged subarray with the corresponding scalar from the flat array.

In [6]:
a + numpy.array([100, 200, 300, 400])

<JaggedArray [[100.  101.1 102.2] [] [303.3 304.4] [405.5 406.6 407.7 408.8 409.9]] at 724b54a5c590>

Just as with broadcasting in Numpy, the length of the jagged array (4) should match the length of the flat array (5 in the case below). The following should fail.

In [7]:
a + numpy.array([100, 200, 300, 400, 500])

ValueError: cannot broadcast JaggedArray of shape (4,) with Numpy array of shape (5,)

Additionally, slices of jagged arrays should act on the subarrays as elements.

In [8]:
a[::2]

<JaggedArray [[0.  1.1 2.2] [3.3 4.4]] at 724b549bdb90>

And multidimensional indexing should work as it does in Numpy. In the case below, we select the first and third subarray (first dimension) and then the second element from each (second dimension). Unlike Numpy, some subarrays might have a second element while others don't. All of the requisite checks are implemented without Python for loops.

In [9]:
a[::2, 1]

array([1.1, 4.4])

## Nested jagged arrays

If we want subarrays within subarrays, we can make one `JaggedArray` the `content` of another. All of the jagged indexing rules apply recursively.

In [10]:
b = JaggedArray.fromoffsets([0, 2, 4], a)
b

<JaggedArray [[[0.  1.1 2.2] []] [[3.3 4.4] [5.5 6.6 7.7 8.8 9.9]]] at 724b5497b050>

Since this jagged-jagged array is three dimensional, we can pass up to three integers/slices/masks/fancy indexes in its square brackets.

Below, we select the second item (`[[3.3, 4.4], [5.5, 6.6, 7.7, 8.8, 9.9]]`), then take all subitems, then take the second subsubitem from each (`4.4` from the first and `6.6` from the second).

In [11]:
b[1, :, 1]

array([4.4, 6.6])

Sometimes, it's easier to see the structure by converting to Python lists (for small objects).

In [12]:
a.tolist()

[[0.0, 1.1, 2.2], [], [3.3, 4.4], [5.5, 6.6, 7.7, 8.8, 9.9]]

In [13]:
b.tolist()

[[[0.0, 1.1, 2.2], []], [[3.3, 4.4], [5.5, 6.6, 7.7, 8.8, 9.9]]]

It can also help to view them as Pandas DataFrames. In Pandas, we represent the nested structure with a MultiIndex: the first key column is the outer index (`0`, `2`, `3`, skipping the empty one) and the second key column is the inner index.

In [14]:
a.pandas()

Unnamed: 0,Unnamed: 1,0
0,0,0.0
0,1,1.1
0,2,2.2
2,0,3.3
2,1,4.4
3,0,5.5
3,1,6.6
3,2,7.7
3,3,8.8
3,4,9.9


With a jagged-jagged array, the MultiIndex gets three columns.

In [15]:
b.pandas()

Unnamed: 0,Unnamed: 1,Unnamed: 2,0
0,0,0,0.0
0,0,1,1.1
0,0,2,2.2
1,0,0,3.3
1,0,1,4.4
1,1,0,5.5
1,1,1,6.6
1,1,2,7.7
1,1,3,8.8
1,1,4,9.9


These views and some of the internal calculations are made possibly by switching representations. A jagged array expressed by `starts` and `stops` can be expressed by a `parents` array: one element per `content` element, specifying the outer array (i.e. the "parent") the content belongs to.

The representations have different strengths and weaknesses: `parents` allows reduction operations to be vectorized, but it can only express empty lists by omission. In particular, we'll never know if there are any empty subarrays after the last element!

In [16]:
a.parents

array([0, 0, 0, 2, 2, 3, 3, 3, 3, 3])

In [17]:
a.index

<JaggedArray [[0 1 2] [] [0 1] [0 1 2 3 4]] at 724b54a31050>

In the [array types language](https://mybinder.org/v2/gh/scikit-hep/awkward-array/0.0.4?filepath=binder%2Farray-types.ipynb), the jagged and the jagged-jagged arrays differ by one index:

In [18]:
print(a.type)
print(b.type)

[0, 4) -> [0, inf) -> float64
[0, 2) -> [0, inf) -> [0, inf) -> float64


Continuing with our tour of how jagged arrays extend Numpy, consider reductions: in Numpy, they turn flat arrays into scalars. They should turn jagged arrays into flat arrays.

In [19]:
a.sum()

array([ 3.3,  0. ,  7.7, 38.5])

For maximization and minimization, we need an identity element for empty lists. Infinity is a natural identity for minimization, and negative infinity is a natural identity for maximization.

In [20]:
a.max()

array([ 2.2, -inf,  4.4,  9.9])

When computing the `argmax` (index at which the maximum appears), it's better to return a singleton or an empty list, represented as jagged arrays.

In [21]:
a.argmax()

<JaggedArray [[2] [] [1] [4]] at 724b548f5cd0>

That way, the `argmax` can be used as a fancy index for a jagged array of the same structure (just as `argmax` is typically used in Numpy).

In [22]:
a[a.argmax()]

<JaggedArray [[2.2] [] [4.4] [9.9]] at 724b5497ba90>

If we don't care which subarrays were empty, we can `flatten` the result. Using this `JaggedArray` data structure, list-flattening is very fast because the flattened result is usually just the `content` (when `starts` and `stops` are contiguous, which they usually are).

In [23]:
a[a.argmax()].flatten()

array([2.2, 4.4, 9.9])

For completeness, note that we can use jagged masks just as we can use jagged fancy indexes. The results are what you'd expect.

In [24]:
a > 5

<JaggedArray [[False False False] [] [False False] [ True  True  True  True  True]] at 724b549293d0>

In [25]:
a[a > 5]

<JaggedArray [[] [] [] [5.5 6.6 7.7 8.8 9.9]] at 724b549292d0>

In [26]:
a[a > 5].sum()

array([ 0. ,  0. ,  0. , 38.5])

## Tables

It may not be clear at first why we need a `Table` type— after all, 

In [51]:
c = Table.zip(one=[0, 1, 2, 3, 4], two=[0.0, 1.1, 2.2, 3.3, 4.4])
c

<Table 5 x 2 at 724b5493c710>

In [52]:
c.tolist()

[{'one': 0, 'two': 0.0},
 {'one': 1, 'two': 1.1},
 {'one': 2, 'two': 2.2},
 {'one': 3, 'two': 3.3},
 {'one': 4, 'two': 4.4}]

In [27]:
c.pandas()

Unnamed: 0,two,one
0,0.0,0
1,1.1,1
2,2.2,2
3,3.3,3
4,4.4,4


In [28]:
d = Table.fromrec(numpy.array([(0, 0.0), (1, 1.1), (2, 2.2), (3, 3.3), (4, 4.4)], dtype=[("one", int), ("two", float)]))
d.pandas()

Unnamed: 0,one,two
0,0,0.0
1,1,1.1
2,2,2.2
3,3,3.3
4,4,4.4


In [29]:
c["two"], d["two"]

(array([0. , 1.1, 2.2, 3.3, 4.4]), array([0. , 1.1, 2.2, 3.3, 4.4]))

In [30]:
c[2], d[2]

(<Table.Row 2>, <Table.Row 2>)

In [31]:
c[2].tolist(), d[2].tolist()

({'one': 2, 'two': 2.2}, {'one': 2, 'two': 2.2})

In [32]:
c[2].two, d[2].two

(2.2, 2.2)

In [33]:
(c + 100).pandas()

Unnamed: 0,two,one
0,100.0,100
1,101.1,101
2,102.2,102
3,103.3,103
4,104.4,104


In [43]:
print(c.type)

[0, 5) -> 'two' -> float64
          'one' -> int64


In [34]:
e = JaggedArray.zip(one=a, two=a*2)
e

<JaggedArray [[<Table.Row 0> <Table.Row 1> <Table.Row 2>] [] [<Table.Row 3> <Table.Row 4>] [<Table.Row 5> <Table.Row 6> <Table.Row 7> <Table.Row 8> <Table.Row 9>]] at 724b54929d10>

In [35]:
e.pandas()

Unnamed: 0,Unnamed: 1,two,one
0,0,0.0,0.0
0,1,2.2,1.1
0,2,4.4,2.2
2,0,6.6,3.3
2,1,8.8,4.4
3,0,11.0,5.5
3,1,13.2,6.6
3,2,15.4,7.7
3,3,17.6,8.8
3,4,19.8,9.9


In [39]:
e["two"].tolist()

[[0.0, 2.2, 4.4], [], [6.6, 8.8], [11.0, 13.2, 15.4, 17.6, 19.8]]

In [38]:
e[2].tolist()

[{'one': 3.3, 'two': 6.6}, {'one': 4.4, 'two': 8.8}]

In [41]:
e[2]["two"].tolist()

[6.6, 8.8]

In [48]:
e[::2, 1].tolist()

[{'one': 1.1, 'two': 2.2}, {'one': 4.4, 'two': 8.8}]

In [44]:
print(e.type)

[0, 4) -> [0, inf) -> 'two' -> float64
                      'one' -> float64


In [49]:
f = JaggedArray.fromiter([[True, False], [True, False], [True, False], [True, False]])
g = a.cross(f)
g

<JaggedArray [[<Table.Row 0> <Table.Row 1> <Table.Row 2> <Table.Row 3> <Table.Row 4> <Table.Row 5>] [] [<Table.Row 6> <Table.Row 7> <Table.Row 8> <Table.Row 9>] [<Table.Row 10> <Table.Row 11> <Table.Row 12> ... <Table.Row 17>, <Table.Row 18>, <Table.Row 19>]] at 724b5493c3d0>

In [50]:
g.pandas()

Unnamed: 0,Unnamed: 1,0,1
0,0,0.0,True
0,1,0.0,False
0,2,1.1,True
0,3,1.1,False
0,4,2.2,True
0,5,2.2,False
2,0,3.3,True
2,1,3.3,False
2,2,4.4,True
2,3,4.4,False
