# 03.12 - High-Performance Pandas - eval() and query()

Since version 0.13, Pandas offers tools to directly access C-speed operations without creating intermediate arrays.

### Motivating query() and eval(): Compound Expressions

In [2]:
# example of fast vectorized operation
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y

9.85 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Since every intermediate step is explicitly allocated in memory, when working with large arrays this adds significant memory overhead.

<code>eval()</code> and <code>query()</code> does not use full-size temporary arrays, therefore speeding up the computation considerably.

In [4]:
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

In [5]:
%timeit pd.eval('df1 + df2 + df3 + df4')

351 ms ± 137 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [7]:
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

In [8]:
# typical Pandas approach
%timeit df1 + df2 + df3 + df4

418 ms ± 85.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
# eval() approach
%timeit pd.eval('df1 + df2 + df3 + df4')

341 ms ± 50.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Operations supported by <code>pd.eval()</code>

In [13]:
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
                           for i in range(5))

In [15]:
# arithmetic operations
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)

True

In [16]:
# comparison operators
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

In [17]:
# bitwise operators
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

In [19]:
# boolean operators
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True

### <code>DataFrame.eval()</code> for column-wise operations

The benefit of the eval() method is that columns can be referred to _by name_. For example:

In [20]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.061761,0.925463,0.99742
1,0.209863,0.280456,0.042148
2,0.738991,0.019046,0.715501
3,0.062857,0.516241,0.604588
4,0.204537,0.813392,0.244804


Using <code>pd.eval()</code> as above, we can compute expressions with the three columns like this:

In [22]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)

True

DataFrame.eval() method allows much more succinct evaluation of expressions with the columns:

In [24]:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)

True

### Assignment in <code>DataFrame.eval()</code>

<code>DataFrame.eval()</code> also allows assignment to any column. Let's use our previous df as example:

In [26]:
df.head()

Unnamed: 0,A,B,C
0,0.061761,0.925463,0.99742
1,0.209863,0.280456,0.042148
2,0.738991,0.019046,0.715501
3,0.062857,0.516241,0.604588
4,0.204537,0.813392,0.244804


We can use <code>df.eval()</code> to create a new column 'D' and assign to it a value computed from the other columns:

In [27]:
df.eval('D = (A + B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.061761,0.925463,0.99742,0.989777
1,0.209863,0.280456,0.042148,11.633339
2,0.738991,0.019046,0.715501,1.05945
3,0.062857,0.516241,0.604588,0.95784
4,0.204537,0.813392,0.244804,4.158143


Or we can modify existing columns:

In [28]:
df.eval('D = (A - B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.061761,0.925463,0.99742,-0.865935
1,0.209863,0.280456,0.042148,-1.674903
2,0.738991,0.019046,0.715501,1.00621
3,0.062857,0.516241,0.604588,-0.749906
4,0.204537,0.813392,0.244804,-2.487117


### Local variables in <code>DataFrame.eval()</code>

The DataFrame.eval() method supports an additional syntax that lets it work with local Python variables.

In [29]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

This allows us to call Python objects using <code>@</code>, accessing both column names and variable names.

**Note**: this works **only** for DataFrame.eval() and not for pandas.eval() which can aonly access Python names.

### <code>DataFrame.query()</code> method

The DataFrame has another method based on evaluated strings, called the <code>query()</code> method. 

In [30]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)

True

In [31]:
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)

True

**Note** that the query() method also accepts the @ flag to mark local variables:

In [32]:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)

True