# 03.12 - High-Performance Pandas - eval() and query()

Since version 0.13, Pandas offers tools to directly access C-speed operations without creating intermediate arrays.

### Motivating query() and eval(): Compound Expressions

In [2]:
# example of fast vectorized operation
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y

9.85 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Since every intermediate step is explicitly allocated in memory, when working with large arrays this adds significant memory overhead.

<code>eval()</code> and <code>query()</code> does not use full-size temporary arrays, therefore speeding up the computation considerably.

In [4]:
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

In [5]:
%timeit pd.eval('df1 + df2 + df3 + df4')

351 ms ± 137 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

In [8]:
# typical Pandas approach
%timeit df1 + df2 + df3 + df4

418 ms ± 85.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
# eval() approach
%timeit pd.eval('df1 + df2 + df3 + df4')

341 ms ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Operations supported by <code>pd.eval()</code>

In [13]:
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
                           for i in range(5))

In [15]:
# arithmetic operations
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)

True

In [16]:
# comparison operators
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

In [17]:
# bitwise operators
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

In [19]:
# boolean operators
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True