# Motivating `query()` and `eval()`: Compound Expression

In [1]:
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y

1.27 ms ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [2]:
%timeit np.fromiter((x1 + y1 for x1, y1 in zip(x, y)), dtype=x.dtype, count=len(x))

212 ms ± 5.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [3]:
mask = (x > 0.5) & (y < 0.5)

every intermediate step is explicitly allocated in memory. If the x and y arrays are very large, this can lead to significant memory and computational overhead. The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.

In [5]:
import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)

True

The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays. The Pandas eval() and query() tools that we will discuss here are conceptually similar, and depend on the Numexpr package.

# `pandas.eval()` for Efficient Operations

In [6]:
import pandas as pd
nrows, ncols = 100_000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in range(4))

In [7]:
%timeit df1 + df2 + df3 + df4

70.5 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [10]:
%timeit pd.eval('df1 + df2 + df3 + df4')

33.5 ms ± 653 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [11]:
np.allclose(df1 + df2 + df3 + df4, pd.eval('df1 + df2 + df3 + df4'))

True

# `DataFrame.eval()` for Column-Wise Operators

## Local variables in DataFrame.eval()

In [14]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


In [16]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

# DataFrame.query() Method

In [17]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)

True

In [18]:
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)

True

In [19]:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)

True