# Read the dataset

In [1]:
import gzip
import json    

In [2]:
user = 'um106329'
dataset_path = '/hpcwork/' + user + '/jet_flavor_MLPhysics/dataset/'

## Some initial investigations
Here, the problem is that the usual way of reading a JSON does not work, because the JSON is sort of a "list of JSONs" in different lines of the zipped file. So one gets rather strange error messages when unzipping / extracting the data.

Here are the links where I got the ideas from:  
<a href="https://stackoverflow.com/questions/56677516/how-to-open-a-json-gz-file-and-return-to-dictionary-in-python" target="_blank">https://stackoverflow.com/questions/56677516/how-to-open-a-json-gz-file-and-return-to-dictionary-in-python</a>  
<a href="https://stackoverflow.com/questions/65276808/how-to-read-json-string-from-gzip" target="_blank">https://stackoverflow.com/questions/65276808/how-to-read-json-string-from-gzip</a>

## awkward to the rescue?
<a href="https://github.com/scikit-hep/awkward-1.0/issues/437" target="_blank">https://github.com/scikit-hep/awkward-1.0/issues/437</a>

In [6]:
import time

In [7]:
import awkward1 as ak
import numpy as np

In [8]:
%time
builder = ak.ArrayBuilder()
for lineno, line in enumerate(open(dataset_path+"dataset.json.gz")):
    if lineno % 114919 == 0:
        print(time.strftime("%H:%M:%S"), ":", lineno/114919, "percent complete")
    builder.append(json.loads(line))
array = builder.snapshot()

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 5.48 µs


UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

## gunzip
It's possible that one first has to unzip the dataset outside of any python code, to make it useable by e.g. awkward1 (hopefully).

<a href="https://www.baeldung.com/linux/gzip-and-gunzip" target="_blank">https://www.baeldung.com/linux/gzip-and-gunzip</a>  
<a href="https://unix.stackexchange.com/questions/156261/unzipping-a-gz-file-without-removing-the-gzipped-file/156324" target="_blank">https://unix.stackexchange.com/questions/156261/unzipping-a-gz-file-without-removing-the-gzipped-file/156324</a>

In [25]:
!gunzip < /hpcwork/um106329/jet_flavor_MLPhysics/dataset/dataset.json.gz > /hpcwork/um106329/jet_flavor_MLPhysics/dataset/dataset.json

Now it should be possible to try the previous methods again, but with decompressed data.

## Back to awkward-arrays
J. Pivarski's <a href="https://github.com/scikit-hep/awkward-1.0/issues/437#issuecomment-688389890" target="_blank">code</a>

He got the number of lines via
```
wc -l dataset.json
```
and divided by 100 to be able to produce the printout of the progress.

In [12]:
# just for info, it looks like that for me:
!wc -l /hpcwork/um106329/jet_flavor_MLPhysics/dataset/dataset.json

11491971 /hpcwork/um106329/jet_flavor_MLPhysics/dataset/dataset.json


In [13]:
%time
builder = ak.ArrayBuilder()
for lineno, line in enumerate(open(dataset_path+"dataset.json")):
    if lineno % 114919 == 0:
        print(time.strftime("%H:%M:%S"), ":", lineno/114919, "percent complete")
    builder.append(json.loads(line))
array = builder.snapshot()

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 11.4 µs
23:27:58 : 0.0 percent complete
23:28:08 : 1.0 percent complete
23:28:17 : 2.0 percent complete
23:28:25 : 3.0 percent complete
23:28:33 : 4.0 percent complete
23:28:40 : 5.0 percent complete
23:28:49 : 6.0 percent complete
23:28:57 : 7.0 percent complete
23:29:05 : 8.0 percent complete
23:29:14 : 9.0 percent complete
23:29:22 : 10.0 percent complete
23:29:30 : 11.0 percent complete
23:29:37 : 12.0 percent complete
23:29:44 : 13.0 percent complete
23:29:53 : 14.0 percent complete
23:30:03 : 15.0 percent complete
23:30:10 : 16.0 percent complete
23:30:18 : 17.0 percent complete
23:30:25 : 18.0 percent complete
23:30:33 : 19.0 percent complete
23:30:43 : 20.0 percent complete
23:30:50 : 21.0 percent complete
23:31:00 : 22.0 percent complete
23:31:08 : 23.0 percent complete
23:31:15 : 24.0 percent complete
23:31:23 : 25.0 percent complete
23:31:31 : 26.0 percent complete
23:31:38 : 27.0 percent complete
23:31:46 : 28.0 pe

In [14]:
array

<Array [[47.9, 1.89, 5, [21.2, ... 6, 1]]]]] type='11491971 * var * union[float6...'>

In [15]:
length_array = len(array)
length_array

11491971

That's the correct number.

## An attempt to save the file in a better way, and split up
Handle the file saving via parquet (pyarrow) from now on (said to be faster than parsing a "jagged JSON").

<a href="https://stackoverflow.com/questions/66851761/best-way-to-save-a-dict-of-awkward1-arrays" target="_blank">https://stackoverflow.com/questions/66851761/best-way-to-save-a-dict-of-awkward1-arrays</a>

Had to do `conda install -c conda-forge pyarrow` first.

In [16]:
ak.to_parquet(array, dataset_path+'akArray_dataset.parquet')

ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: dense_union<0: double=0, 1: list<item: dense_union<0: double=0, 1: list<item: list<item: dense_union<0: double=0, 1: string=1>>>=1, 2: string=2>>=1>

There are some strings as placeholders for non-existing values, see also  
<a href="https://github.com/scikit-hep/awkward-1.0/pull/568" target="_blank">https://github.com/scikit-hep/awkward-1.0/pull/568</a>  
(redirected from <a href="https://github.com/scikit-hep/awkward-1.0/issues/437" target="_blank">https://github.com/scikit-hep/awkward-1.0/issues/437</a>)

In [18]:
def fix(x):
    if isinstance(x, list):
        return [fix(y) for y in x]
    elif x == "NaN" or x == "-NaN":
        return np.nan
    elif x == "inf":
        return np.inf
    elif x == "-inf":
        return -np.inf
    elif isinstance(x, str):
        raise Exception("unhandled string: " + repr(x))
    else:
        return x

In [None]:
b1 = ak.ArrayBuilder()
b2 = ak.ArrayBuilder()
b3 = ak.ArrayBuilder()
b4 = ak.ArrayBuilder()
b5 = ak.ArrayBuilder()
b6 = ak.ArrayBuilder()
for lineno, line in enumerate(open(dataset_path+"dataset.json")):
    if lineno % 114919 == 0:
        print(time.strftime("%H:%M:%S"), ":", lineno/114919, "percent complete")
    if lineno == 100000:
        break
    l1, l2, l3, l4, l5, l6 = fix(json.loads(line))
    b1.append(l1)
    b2.append(l2)
    b3.append(l3)
    b4.append(l4)
    b5.append(l5)
    b6.append(l6)
    
a1 = b1.snapshot()
a2 = b2.snapshot()
a3 = b3.snapshot()
a4 = b4.snapshot()
a5 = b5.snapshot()
a6 = b6.snapshot()

In [10]:
array = ak.Array({"a1": a1, "a2": a2, "a3": a3, "a4": a4, "a5": a5, "a6": a6})

In [None]:
ak.type(array)

In [12]:
array.a1

<Array [47.9, 35, 26.6, ... 25.1, 26.5, 27.8] type='100000 * float64'>

In [13]:
array.a2

<Array [1.89, 0.61, -0.53, ... 2.28, -0.0922] type='100000 * float64'>

In [14]:
array.a3

<Array [5, 5, 5, 5, 5, 5, ... 5, 5, 5, 0, 5, 5] type='100000 * int64'>

In [15]:
array.a4

<Array [[21.2, 8.37, 29, ... 0.184, 0.144]] type='100000 * var * float64'>

In [16]:
array.a5

<Array [[32.9, 3, 7, ... 0.264, 0.846, 0.489]] type='100000 * var * float64'>

In [17]:
array.a6

<Array [[[[0.0312, 0.082, ... 2, 0.489]]]] type='100000 * var * var * var * float64'>

In [18]:
ak.to_parquet(array, dataset_path+'akArray_0_99999'+'.parquet')

In [19]:
loaded_array = ak.from_parquet(dataset_path+'akArray_0_99999'+'.parquet')

In [33]:
!ls /hpcwork/um106329/jet_flavor_MLPhysics/dataset

akArray_0_99999.parquet  akArray_dataset.parquet  dataset.json
akArray_0.parquet	 akArrays		  dataset.json.gz


In [21]:
!mkdir /hpcwork/um106329/jet_flavor_MLPhysics/dataset/akArrays

In [42]:
def fill_arrays(a,b):
    b1 = ak.ArrayBuilder()
    b2 = ak.ArrayBuilder()
    b3 = ak.ArrayBuilder()
    b4 = ak.ArrayBuilder()
    b5 = ak.ArrayBuilder()
    b6 = ak.ArrayBuilder()
    for lineno, line in enumerate(open(dataset_path+"dataset.json")):
        if lineno < a:
            continue
        if lineno > (b-1):
            break
        l1, l2, l3, l4, l5, l6 = fix(json.loads(line))
        b1.append(l1)
        b2.append(l2)
        b3.append(l3)
        b4.append(l4)
        b5.append(l5)
        b6.append(l6)
    
    a1 = b1.snapshot()
    a2 = b2.snapshot()
    a3 = b3.snapshot()
    a4 = b4.snapshot()
    a5 = b5.snapshot()
    a6 = b6.snapshot()
    
    array = ak.Array({"a1": a1, "a2": a2, "a3": a3, "a4": a4, "a5": a5, "a6": a6})
    ak.to_parquet(array, dataset_path+'akArrays/split_{0}_{1}.parquet'.format(a,b-1))

In [43]:
splits = []
total = 11491971
for k in range(0,total,50000):
    splits.append(k)
splits.append(total)
len(splits)

231

In [32]:
fill_arrays(0,100000)

In [35]:
!rm /hpcwork/um106329/jet_flavor_MLPhysics/dataset/akArrays/*

In [36]:
!ls /hpcwork/um106329/jet_flavor_MLPhysics/dataset/akArrays

In [45]:
for i in range(len(splits)-1):
    print(splits[i], splits[i+1])
    fill_arrays(splits[i], splits[i+1])

0 50000
50000 100000
100000 150000
150000 200000
200000 250000
250000 300000
300000 350000
350000 400000
400000 450000
450000 500000
500000 550000
550000 600000
600000 650000
650000 700000
700000 750000
750000 800000
800000 850000
850000 900000
900000 950000
950000 1000000
1000000 1050000
1050000 1100000
1100000 1150000
1150000 1200000
1200000 1250000
1250000 1300000
1300000 1350000
1350000 1400000
1400000 1450000
1450000 1500000
1500000 1550000
1550000 1600000
1600000 1650000
1650000 1700000
1700000 1750000
1750000 1800000
1800000 1850000
1850000 1900000
1900000 1950000
1950000 2000000
2000000 2050000
2050000 2100000
2100000 2150000
2150000 2200000
2200000 2250000
2250000 2300000
2300000 2350000
2350000 2400000
2400000 2450000
2450000 2500000
2500000 2550000
2550000 2600000
2600000 2650000
2650000 2700000
2700000 2750000
2750000 2800000
2800000 2850000
2850000 2900000
2900000 2950000
2950000 3000000
3000000 3050000
3050000 3100000
3100000 3150000
3150000 3200000
3200000 3250000
325000

fyi, this took ~1 hour on login18-1

In [46]:
!ls /hpcwork/um106329/jet_flavor_MLPhysics/dataset/akArrays

split_0_49999.parquet		 split_4800000_4849999.parquet
split_10000000_10049999.parquet  split_4850000_4899999.parquet
split_1000000_1049999.parquet	 split_4900000_4949999.parquet
split_100000_149999.parquet	 split_4950000_4999999.parquet
split_10050000_10099999.parquet  split_5000000_5049999.parquet
split_10100000_10149999.parquet  split_500000_549999.parquet
split_10150000_10199999.parquet  split_50000_99999.parquet
split_10200000_10249999.parquet  split_5050000_5099999.parquet
split_10250000_10299999.parquet  split_5100000_5149999.parquet
split_10300000_10349999.parquet  split_5150000_5199999.parquet
split_10350000_10399999.parquet  split_5200000_5249999.parquet
split_10400000_10449999.parquet  split_5250000_5299999.parquet
split_10450000_10499999.parquet  split_5300000_5349999.parquet
split_10500000_10549999.parquet  split_5350000_5399999.parquet
split_1050000_1099999.parquet	 split_5400000_5449999.parquet
split_10550000_10599999.parquet  split_5450000_5499999.parquet
split_10600000_

In [47]:
!du -sh /hpcwork/um106329/jet_flavor_MLPhysics/dataset/akArrays

4.2G	/hpcwork/um106329/jet_flavor_MLPhysics/dataset/akArrays


In [48]:
!du -sh /hpcwork/um106329/jet_flavor_MLPhysics/dataset/dataset.json

16G	/hpcwork/um106329/jet_flavor_MLPhysics/dataset/dataset.json


In [49]:
!du -sh /hpcwork/um106329/jet_flavor_MLPhysics/dataset/dataset.json.gz

2.3G	/hpcwork/um106329/jet_flavor_MLPhysics/dataset/dataset.json.gz


Which means: the least amount of space is used by the zipped file. After decompressing, it gets very large, and takes a long time to read = bad. Now, the arrays saved by awkward1 take only twice as much space as the original, zipped file, but the reading is much faster.

In [52]:
%%time
for i in range(len(splits)-1):
    array = ak.from_parquet(dataset_path+'akArrays/split_{0}_{1}.parquet'.format(splits[i],splits[i+1]-1))

CPU times: user 26.7 s, sys: 14.3 s, total: 41 s
Wall time: 49.3 s
