As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the `eval()` and `query()` functions, which rely on the [Numexpr](https://github.com/pydata/numexpr) package. 

In [13]:
import pandas as pd
import numpy as np

## Motivating `query()` and `eval()`: Compound Expressions

In [14]:
rng = np.random.RandomState(42)
x = rng.rand(1000)
y = rng.rand(1000)

%timeit x + y

1.21 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


This is much faster than doing the addition via a Python loop or comprehension:

In [15]:
%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))

410 µs ± 6.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


But this abstraction can become less efficient when computing compound expressions. For example:

In [16]:
mask = (x > 0.5) & (y < 0.5)

Because NumPy evaluates each subexpression, this is roughly equivalent to the following:

In [17]:
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 &  tmp2

In other words, *every intermediate step is explicitly allocated in memory*. If the `x` and `y` arrays are very large, this can lead to significant memory and computational overhead. 

The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays. It is sufficient to say that the library accepts a string giving the NumPy-style expression you'd like to compute:

In [18]:
import numexpr

mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)

True

The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays. 👍

## `pandas.eval()` for Efficient Operations

`eval()`: Use string expressions to efficiently compute operation using `DataFrame`s.

In [22]:
import pandas as pd

n_rows, n_cols = 100000, 100
df1, df2, df3, df4 = (pd.DataFrame(rng.randint(100, size=(n_rows, n_cols))) 
                      for i in range(4))

Compute the sum of all four `DataFrame`s using the typical Pandas approach:

In [24]:
%timeit df1 + df2 + df3 + df4

89.5 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Computed via `pd.eval` by constructing the expression as a string:

In [25]:
%timeit pd.eval('df1 + df2 + df3 + df4')

46.3 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


The `eval()` version of this expression is about **50%** faster (and uses much less memory) 👍, while giving the same result:

In [26]:
np.allclose(df1 + df2 + df3 + df4,
            pd.eval('df1 + df2 + df3 + df4'))

True

### Operation supported by `pd.eval()`

In [27]:
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
                           for i in range(5))

#### Arithmetic operators

`pd.eval()` supports all arithmetic operators.

In [28]:
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)

True

#### Comparsion operators

`pd.eval()` supports all comparison operators, including chained expressions:

In [29]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

#### Bitwise operators

`pd.eval()` supports the `&` and `|` bitwise operators:

In [30]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

In addition, it supports the use of the literal `and` and `or` in Boolean expressions:

In [31]:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True

#### Object attributes and indices

`pd.eval()` supports access to object attributes via the `obj.attr` syntax, and indexes via the `obj[index]` syntax:

In [33]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)

True

## `DataFrame.eval()` for Column-wise Operations

`DataFrame`s also have an `eval()` method that works in similar ways. The benefit of the `eval()` method is that columns can be referred to *by name*. 



In [35]:
df = pd.DataFrame(rng.randint(100, size=(100, 3)), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,84,53,33
1,80,48,68
2,23,40,69
3,36,93,83
4,65,42,75


Using `pd.eval()` we can compute expressions with the three columns like this:

In [36]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)

True

The `DataFrame.eval()` method allows much more succinct evaluation of expressions with the columns:

In [37]:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)

True

### Assignment in `DataFrame.eval()`

In [38]:
df.head()

Unnamed: 0,A,B,C
0,84,53,33
1,80,48,68
2,23,40,69
3,36,93,83
4,65,42,75


Use `df.eval()` to create a new column `'D'` and assign to it a value computed from the other columns:

In [39]:
df.eval('D = (A + B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,84,53,33,4.151515
1,80,48,68,1.882353
2,23,40,69,0.913043
3,36,93,83,1.554217
4,65,42,75,1.426667


In the same way, any existing column can be modified.

In [41]:
df.eval('D = (A - B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,84,53,33,0.939394
1,80,48,68,0.470588
2,23,40,69,-0.246377
3,36,93,83,-0.686747
4,65,42,75,0.306667


### Local variables in `DataFrame.eval()`

The `DataFrame.eval()` method supports an additional syntax that lets it work with local Python variables.

In [42]:
column_mean = df.mean(axis=1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

The `@` character here marks a *variable name* rather than a *column name*, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects. 

Notice that this `@` character is only supported by the `DataFrame.eval()` *method*, not by the `pandas.eval()` *function*, because the `pandas.eval()` function only has access to the one (Python) namespace.

## `DataFrame.query()` Method

In [43]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)

True

This is an expression involving columns of the `DataFrame`. It cannot be expressed using the `DataFrame.eval()` syntax. However, instead, for this type of filtering operation, you can use the `query()` method:

In [44]:
result3 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result3)

True

In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand. 

The `query()` method also accepts the `@` flag to mark local variables:

In [48]:
C_mean = df['C'].mean()
result1 = df[(df.A < C_mean) & (df.B < C_mean)]
result2 = df.query('A < @C_mean and B < @C_mean')
np.allclose(result1, result2)

True

## Performance: When to Use These Functions

Two considerations: 

- computation time 

- memory use.

Suggestions: 

- The difference in computation time between the traditional methods and the `eval`/`query` method is usually not significant–if anything, **the traditional method is faster for smaller arrays!** The benefit of `eval`/`query` is mainly in the saved memory, and the sometimes cleaner syntax they offer.

- If the size of the temporary `DataFrame`s is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an `eval()` or `query()` expression. 

