## Hardware Details
[GCP](https://cloud.google.com/) VM: [n1-highmem-16](https://cloud.google.com/compute/docs/machine-types#n1_machine_types) (16 vCPUs, 104 GB memory)

In [1]:
%%bash
lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:            0
CPU MHz:             2300.000
BogoMIPS:            4600.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssb

In [2]:
%%bash
cat /proc/meminfo | head -n1

MemTotal:       107153792 kB


## Basic functions

In [3]:
import vaex
import numpy as np
import random
import string
import gc

In [4]:
def createTable(rowCount):
    gc.collect()
    return vaex.from_arrays(bucket= [''.join(random.choices(string.ascii_lowercase, k=2)) for _ in range(rowCount)],
                  weight= np.random.uniform(0, 2, rowCount),
                  qty= np.random.randint(100, size=rowCount, dtype='int16'),
                  risk= np.random.randint(10, size=rowCount, dtype='int16')).sort('bucket')

In [5]:
def executeQuery(t):
    t['qty_weighted'] = t.qty * t.weight;
    t['risk_weighted'] = t.risk * t.weight;
    res = t.groupby(by='bucket', agg={'NR': vaex.agg.count('qty'), 
                                      'TOTAL_QTY': vaex.agg.sum('qty'),
                                      'AVG_QTY': vaex.agg.mean('qty'),
                                      'TOTAL_RISK': vaex.agg.sum('risk'),
                                      'AVG_RISK': vaex.agg.mean('risk'),
                                      'weight_sum': vaex.agg.sum('weight'),
                                      'qty_weighted_sum': vaex.agg.sum('qty_weighted'),
                                      'risk_weighted_sum': vaex.agg.sum('risk_weighted')})
    res['W_AVG_QTY']=res.qty_weighted_sum / res.weight_sum
    res['W_AVG_RISK']=res.risk_weighted_sum / res.weight_sum
    res.drop(['weight_sum', 'qty_weighted_sum', 'risk_weighted_sum'], inplace=True)
    return res

## Row Number 10k

In [6]:
t = createTable(10 * 1000)

In [7]:
%timeit executeQuery(t)

31.7 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Row Number 100k

In [8]:
del t
t = createTable(100 * 1000)

In [9]:
%timeit executeQuery(t)

70.7 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Row Number 1M

In [10]:
del t
t = createTable(1000 * 1000)

In [11]:
%timeit executeQuery(t)

496 ms ± 32.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 10M

In [12]:
del t
t = createTable(10 * 1000 * 1000)

In [13]:
%timeit executeQuery(t)

2.26 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 100M

In [14]:
del t
t = createTable(100 * 1000 * 1000)

In [15]:
%timeit -n 1 -r 10 executeQuery(t)

21.5 s ± 682 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)


## Row Number 1B

In [16]:
del t
t = createTable(1000 * 1000 * 1000)

In [17]:
%timeit -n 1 -r 10 executeQuery(t)

3min 41s ± 4.67 s per loop (mean ± std. dev. of 10 runs, 1 loop each)
