The `Numexpr` library provides the ability to compute type of compound expression element by element, without the need to allocate full intermediate arrays. It is sufficient to say that the library accepts a string giving the NumPy-style expression we'd like to compute.

The benefit here is that `Numexpr` evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays. 👍

In [2]:
import numpy as np
import pandas as pd

## `pd.eval()`

`eval()`: Use string expressions to efficiently compute operation using `DataFrame`s. (*eval* stands for "evaluate")

In [18]:
rng = np.random.RandomState(42)
n_rows, n_cols = 100000, 100
df1, df2, df3, df4 = (pd.DataFrame(rng.randint(100, size=(n_rows, n_cols))) 
                      for i in range(4))

In [21]:
result1 = df1 + df2 + df3 + df4
# result1

In [20]:
%timeit df1 + df2 + df3 + df4

94.5 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [22]:
result2 = pd.eval('df1 + df2 + df3 + df4')
# result2

In [23]:
%timeit pd.eval('df1 + df2 + df3 + df4')

58.3 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
np.allclose(result1, result2) # Check whether the two result are the same

True

Both methods are equivalent, but `eval()` is much faster.

### Operation supported by `pd.eval()`

- All **arithmetic** operators

In [25]:
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
                           for i in range(5))

In [26]:
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)

True

- All **comparsion** operators (including chained expressions)

In [27]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

- **Bitwise** operators 

    - &` and `|`
    
    - `and` and `or`

In [28]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

In [29]:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True

- Object attributes and indices

    - Access to object attributes: `obj.attr`
    - Indexing: `obj[index]`

In [30]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)

True

## `DataFrame.eval()`

 The benefit of `DataFrame.eval()` method is that columns can be referred to *by name*. 

In [31]:
df = pd.DataFrame(rng.randint(100, size=(100, 3)), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,34,28,78
1,79,31,84
2,73,99,86
3,62,28,9
4,27,68,79


In [34]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)

# Use pd.eval()
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")

# Use df.eval()
result3 = df.eval('(A + B) / (C - 1)')

print(np.allclose(result1, result2))
print(np.allclose(result1, result3))

True
True


### Assignment

In [35]:
df.head()

Unnamed: 0,A,B,C
0,34,28,78
1,79,31,84
2,73,99,86
3,62,28,9
4,27,68,79


In [36]:
df.eval('D = (A + B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,34,28,78,0.794872
1,79,31,84,1.309524
2,73,99,86,2.0
3,62,28,9,10.0
4,27,68,79,1.202532


### Work with local variables

Using the `@` character we can efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects. 

Note: `@` character is only supported by the `DataFrame.eval()` *method*, not by the `pandas.eval()` *function*.

In [37]:
column_mean = df.mean(axis=1)

result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

## `DataFrame.query()`

For filtering operation we can use `df.query()`.

In [38]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]

# Use pd.eval()
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')

# Use df.query()
result3 = df.query('A < 0.5 and B < 0.5')

print(np.allclose(result1, result2))
print(np.allclose(result1, result3))

True
True


In addition to being a more efficient computation, compared to the masking expression this is **much easier to read and understand**. 

The `query()` method also accepts the `@` flag to mark local variables.

In [39]:
C_mean = df['C'].mean()
result1 = df[(df.A < C_mean) & (df.B < C_mean)]
result2 = df.query('A < @C_mean and B < @C_mean')
np.allclose(result1, result2)

True

## When to Use These Functions

- The difference in computation time between the traditional methods and the `eval`/`query` method is usually not significant (The traditional method is faster for smaller arrays!)

- Benefit of `eval()` and `query()`:
    
     - Mainly in the saved memory
     - Cleaner syntax
     
 - If the size of the temporary `DataFrame`s is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an `eval()` or `query()` expression.