## Hardware Details
[GCP](https://cloud.google.com/) VM: [n1-highmem-16](https://cloud.google.com/compute/docs/machine-types#n1_machine_types) (16 vCPUs, 104 GB memory)

In [1]:
%%bash
lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:              0
CPU MHz:               2300.000
BogoMIPS:              4600.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyperviso

In [2]:
%%bash
cat /proc/meminfo | head -n1

MemTotal:       107091244 kB


## Basic functions

In [19]:
import os

os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
import modin.pandas as pd
import numpy as np
import random
import string
import gc
import functools

In [4]:
def createTable(rowCount):
    gc.collect()
    return pd.DataFrame({'bucket': [''.join(random.choices(string.ascii_lowercase, k=2)) for _ in range(rowCount)],
                  'weight': np.random.uniform(0, 2, rowCount),
                  'qty': np.random.randint(100, size=rowCount, dtype='int16'),
                  'risk': np.random.randint(10, size=rowCount, dtype='int16')})

In [5]:
def executeQueryJoin(t):
    res = t.groupby('bucket').agg({'bucket': len, 'qty': [sum, np.mean], 'risk': [sum, np.mean]})
    res.columns = res.columns.map('_'.join)
    return res.rename(columns={'bucket_len':'NR', 'qty_sum':'TOTAL_QTY','qty_mean':'AVG_QTY', 
                        'risk_sum':'TOTAL_RISK','risk_mean':'AVG_RISK'}).join(
        t.groupby('bucket').apply(lambda g: np.average(g.qty, weights=g.weight)).transpose().rename(columns={0: 'W_AVG_QTY'})).join(
        t.groupby('bucket').apply(lambda g: np.average(g.risk, weights=g.weight)).transpose().rename(columns={0: 'W_AVG_RISK'}))

## Row Number 10k

In [6]:
t = createTable(10 * 1000)



In [7]:
%timeit executeQueryJoin(t)

To request implementation, send an email to feature_requests@modin.org.


1.41 s ± 19.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 100k

In [8]:
del t
t = createTable(100 * 1000)

In [9]:
%timeit executeQueryJoin(t)

2.39 s ± 32.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 1M

In [10]:
del t
t = createTable(1000 * 1000)

In [11]:
%timeit executeQueryJoin(t)

12.7 s ± 46.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 10M

In [12]:
del t
t = createTable(10 * 1000 * 1000)

In [13]:
%timeit executeQueryJoin(t)

2min 10s ± 2.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 100M

In [14]:
del t
t = createTable(100 * 1000 * 1000)

In [15]:
%timeit -n 1 -r 10 executeQueryJoin(t)

[2m[36m(pid=31871)[0m Fatal Python error: Bus error
[2m[36m(pid=31871)[0m 
[2m[36m(pid=31863)[0m Fatal Python error: Bus error
[2m[36m(pid=31863)[0m 
[2m[36m(pid=31865)[0m Fatal Python error: Bus error
[2m[36m(pid=31865)[0m 
[2m[36m(pid=31870)[0m Fatal Python error: Bus error
[2m[36m(pid=31870)[0m 
[2m[36m(pid=31864)[0m Fatal Python error: Bus error
[2m[36m(pid=31864)[0m 
[2m[36m(pid=31869)[0m Fatal Python error: Bus error
[2m[36m(pid=31869)[0m 
[2m[36m(pid=31875)[0m Traceback (most recent call last):
[2m[36m(pid=31875)[0m   File "/usr/local/lib/python3.6/site-packages/ray/workers/default_worker.py", line 98, in <module>
[2m[36m(pid=31875)[0m     ray.worker.global_worker.main_loop()
[2m[36m(pid=31875)[0m   File "/usr/local/lib/python3.6/site-packages/ray/worker.py", line 1087, in main_loop
[2m[36m(pid=31875)[0m     task = self._get_next_task_from_raylet()
[2m[36m(pid=31875)[0m   File "/usr/local/lib/python3.6/site-packages/ray/worke

ArrowIOError: Encountered unexpected EOF

## Row Number 1B

In [None]:
del t
t = createTable(1000 * 1000 * 1000)

In [None]:
%timeit -n 1 -r 10 executeQueryJoin(t)