## Hardware Details
[GCP](https://cloud.google.com/) VM: [n1-highmem-8](https://cloud.google.com/compute/docs/machine-types#n1_machine_types) (8 vCPUs, 52 GB memory)

In [1]:
%%bash
lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:              0
CPU MHz:               2300.000
BogoMIPS:              4600.00
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor l

In [2]:
%%bash
cat /proc/meminfo | head -n1

MemTotal:       53421588 kB


## Basic functions

In [3]:
import pandas as pd
import numpy as np
import random
import string
import gc

In [4]:
def createTable(rowCount):
    gc.collect()
    return pd.DataFrame({'bucket': [''.join(random.choices(string.ascii_lowercase, k=2)) for _ in range(rowCount)],
                  'weight': np.random.uniform(0, 2, rowCount),
                  'qty': np.random.randint(100, size=rowCount),
                  'risk': np.random.randint(10, size=rowCount)})

In [5]:
def fn(t):
    res = t.groupby('bucket').agg({'bucket': len, 'qty': [sum, np.mean], 'risk': [sum, np.mean]})
    res.columns = res.columns.map('_'.join)
    return res.rename(columns={'bucket_len':'NR', 'qty_sum':'TOTAL_QTY','qty_mean':'AVG_QTY', 
                        'risk_sum':'TOTAL_RISK','risk_mean':'AVG_RISK'}).join(
        t.groupby('bucket').apply(lambda g: np.average(g.qty, weights=g.weight)).to_frame('W_AVG_QTY')).join(
        t.groupby('bucket').apply(lambda g: np.average(g.risk, weights=g.weight)).to_frame('W_AVG_RISK'))


In [6]:
def my_agg(x):
    data = {'NR': x.bucket.count(),
            'TOTAL_QTY': x.qty.sum(),
            'AVG_QTY': x.qty.mean(),
            'TOTAL_RISK': x.risk.sum(),
            'AVG_RISK': x.risk.mean(),
            'W_AVG_QTY':  np.average(x.qty, weights=x.weight),
            'W_AVG_RISK':  np.average(x.risk, weights=x.weight)
           }
    return pd.Series(data, index=['NR', 'TOTAL_QTY', 'AVG_QTY', 'TOTAL_RISK', 
                                  'AVG_RISK', 'W_AVG_QTY', 'W_AVG_RISK'])

## Row Number 10k

In [7]:
t = createTable(10 * 1000)

In [8]:
%timeit fn(t)

302 ms ± 6.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
%timeit t.groupby('bucket').apply(my_agg).astype({'NR': 'int64', 'TOTAL_QTY': 'int64', 'TOTAL_RISK': 'int64'})

1.05 s ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 100k

In [10]:
t = createTable(100 * 1000)

In [11]:
%timeit fn(t)

345 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
%timeit t.groupby('bucket').apply(my_agg).astype({'NR': 'int64', 'TOTAL_QTY': 'int64', 'TOTAL_RISK': 'int64'})

1.09 s ± 9.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 1M

In [13]:
t = createTable(1000 * 1000)

In [14]:
%timeit fn(t)

1.16 s ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
%timeit t.groupby('bucket').apply(my_agg).astype({'NR': 'int64', 'TOTAL_QTY': 'int64', 'TOTAL_RISK': 'int64'})

1.58 s ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 10M

In [16]:
t = createTable(10 * 1000 * 1000)

In [17]:
%timeit fn(t)

13.1 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%timeit t.groupby('bucket').apply(my_agg).astype({'NR': 'int64', 'TOTAL_QTY': 'int64', 'TOTAL_RISK': 'int64'})

7.62 s ± 57.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 100M

In [19]:
t = createTable(100 * 1000 * 1000)

In [20]:
%timeit fn(t)

2min 5s ± 1.01 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [21]:
%timeit t.groupby('bucket').apply(my_agg).astype({'NR': 'int64', 'TOTAL_QTY': 'int64', 'TOTAL_RISK': 'int64'})

1min 6s ± 232 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Row Number 1000M

In [None]:
t = createTable(1000 * 1000 * 1000)

In [None]:
%timeit -n 10 fn(t)

In [None]:
%timeit t.groupby('bucket').apply(my_agg).astype({'NR': 'int64', 'TOTAL_QTY': 'int64', 'TOTAL_RISK': 'int64'})