In [1]:
import pandas as pd
import numpy as np

As we’ve already seen in previous chapters, the power of the PyData stack is built
upon the ability of NumPy and Pandas to push basic operations into C via an intu‐
itive syntax: examples are vectorized/broadcasted operations in NumPy, and
grouping-type operations in Pandas. While these abstractions are efficient and effec‐
tive for many common use cases, they often rely on the creation of temporary inter‐
mediate objects, which can cause undue overhead in computational time and
memory use.

# Motivating query() and eval(): Compound Expressions

In other words, every intermediate step is explicitly allocated in memory. If the x and y
arrays are very large, this can lead to significant memory and computational over‐
head. The Numexpr library gives you the ability to compute this type of compound
expression element by element, without the need to allocate full intermediate arrays.
The Numexpr documentation has more details, but for the time being it is sufficient
to say that the library accepts a string giving the NumPy-style expression you’d like to
compute:

# pandas.eval() for Efficient Operations

In [9]:
#The eval() function in Pandas uses string expressions to efficiently compute opera‐
#tions using DataFrame s. For example, consider the following DataFrame s:

nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
for i in range(4))

In [10]:
#To compute the sum of all four DataFrame s using the typical Pandas approach, we can
#just write the sum:

%timeit df1 + df2 + df3 + df4

202 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [11]:
#We can compute the same result via pd.eval by constructing the expression as a
#string:
%timeit pd.eval('df1 + df2 + df3 + df4')

53.7 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
#The eval() version of this expression is about 40-50% faster (and uses much less mem‐
#ory), while giving the same result:

In [13]:
np.allclose(df1 + df2 + df3 + df4,
pd.eval('df1 + df2 + df3 + df4'))

True

# Operations supported by pd.eval()

In [14]:
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
for i in range(5))

# Arithmetic operators.

pd.eval() supports all arithmetic operators. For example:

In [15]:
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)

True

# Comparison operators.
pd.eval() supports all comparison operators, including chained expressions:

In [17]:
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)

True

# Bitwise operators. 
pd.eval() supports the & and | bitwise operators:

In [18]:
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)

True

In [19]:
#In addition, it supports the use of the literal and and or in Boolean expressions:
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)

True

# Object attributes and indices. 

pd.eval() supports access to object attributes via the
obj.attr syntax, and indexes via the obj[index] syntax:

In [20]:
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)

True

# Other operations. 

Other operations, such as function calls, conditional statements,
loops, and other more involved constructs, are currently not implemented in
pd.eval() . If you’d like to execute these more complicated types of expressions, you
can use the Numexpr library itself.

# DataFrame.eval() for Column-Wise Operations

In [21]:
#Just as Pandas has a top-level pd.eval() function, DataFrame s have an eval()
#method that works in similar ways. The benefit of the eval() method is that columns
#can be referred to by name. We’ll use this labeled array as an example:

df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


In [22]:
#Using pd.eval() as above, we can compute expressions with the three columns like
#this:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)

True

In [23]:
#The DataFrame.eval() method allows much more succinct evaluation of expressions
#with the columns:
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)

True

Notice here that we treat column names as variables within the evaluated expression,
and the result is what we would wish

# Assignment in DataFrame.eval()

In [24]:
#In addition to the options just discussed, DataFrame.eval() also allows assignment
#to any column. Let’s use the DataFrame from before, which has columns 'A' , 'B' , and
#'C' :
df.head()

Unnamed: 0,A,B,C
0,0.375506,0.406939,0.069938
1,0.069087,0.235615,0.154374
2,0.677945,0.433839,0.652324
3,0.264038,0.808055,0.347197
4,0.589161,0.252418,0.557789


In [25]:
#We can use df.eval() to create a new column 'D' and assign to it a value computed
#from the other columns:
df.eval('D = (A + B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,11.18762
1,0.069087,0.235615,0.154374,1.973796
2,0.677945,0.433839,0.652324,1.704344
3,0.264038,0.808055,0.347197,3.087857
4,0.589161,0.252418,0.557789,1.508776


In [27]:
#In the same way, any existing column can be modified:

df.eval('D = (A - B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.375506,0.406939,0.069938,-0.449425
1,0.069087,0.235615,0.154374,-1.078728
2,0.677945,0.433839,0.652324,0.374209
3,0.264038,0.808055,0.347197,-1.566886
4,0.589161,0.252418,0.557789,0.603708


# Local variables in DataFrame.eval()
The DataFrame.eval() method supports an additional syntax that lets it work with
local Python variables. Consider the following:

In [28]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

The @ character here marks a variable name rather than a column name, and lets you
efficiently evaluate expressions involving the two “namespaces”: the namespace of
columns, and the namespace of Python objects. Notice that this @ character is only
supported by the DataFrame.eval() method, not by the pandas.eval() function,
because the pandas.eval() function only has access to the one (Python) namespace.

# DataFrame.query() Method
The DataFrame has another method based on evaluated strings, called the query()
method. Consider the following:

In [29]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)

True

As with the example used in our discussion of DataFrame.eval() , this is an expres‐
sion involving columns of the DataFrame . It cannot be expressed using the Data
Frame.eval() syntax, however! Instead, for this type of filtering operation, you can
use the query() method:

In [30]:
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)

True

In addition to being a more efficient computation, compared to the masking expres‐
sion this is much easier to read and understand. Note that the query() method also
accepts the @ flag to mark local variables:

In [31]:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)

True

# Thank You