# Pandas : eval () and query ()

As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the eval() and query() functions, which rely on the Numexpr package.

# Compound expression

In [6]:
import numpy as np
rng=np.random.RandomState(42)
x=rng.rand(3)
y=rng.rand(3)
print(x)
print(y)
%timeit x+y

[0.37454012 0.95071431 0.73199394]
[0.59865848 0.15601864 0.15599452]
1.18 µs ± 39.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [8]:
#this is much faster than doing the addition via a Python loop or comprehension
%timeit np.fromiter ((xi + yi for xi,yi in zip (x,y)),dtype=x.dtype,count=len(x))

6.81 µs ± 331 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [9]:
# this abstraction can become less efficient 
# when computing compound expression
mask=(x >0.5 ) & (y<0.5)
mask

array([False,  True,  True])

In [11]:
tmp1=(x>0.5)
tmp2=(y<0.5)
mask = tmp1 & tmp2
mask

array([False,  True,  True])

In other words, every intermediate step is explicitly allocated in memory. If the x and y arrays are very large, this can lead to significant memory and computational overhead. The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays. The <b>Numexpr documentation</b> has more details, but for the time being it is sufficient to say that the library accepts a string giving the NumPy-style expression you’d like to compute:


In [12]:
import numexpr
mask_numexpr=numexpr.evaluate('(x > 0.5 ) & (y < 0.5 )')
print(mask_numexpr)
np.allclose(mask,mask_numexpr)

[False  True  True]


True

The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays.

## Pandas.eval () for efficient operation

In [18]:
# eval() in pandas use string expression effiently compute operation using DataFrame
import pandas as pd
nrows,ncols=100000,100
rng=np.random.RandomState(42)
df1,df2,df3,df4=(pd.DataFrame(rng.rand(nrows,ncols))
                for i in range(4))

df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.37454,0.950714,0.731994,0.598658,0.156019,0.155995,0.058084,0.866176,0.601115,0.708073,...,0.119594,0.713245,0.760785,0.561277,0.770967,0.493796,0.522733,0.427541,0.025419,0.107891
1,0.031429,0.63641,0.314356,0.508571,0.907566,0.249292,0.410383,0.755551,0.228798,0.07698,...,0.093103,0.897216,0.900418,0.633101,0.33903,0.34921,0.725956,0.89711,0.887086,0.779876
2,0.642032,0.08414,0.161629,0.898554,0.606429,0.009197,0.101472,0.663502,0.005062,0.160808,...,0.0305,0.037348,0.822601,0.360191,0.127061,0.522243,0.769994,0.215821,0.62289,0.085347
3,0.051682,0.531355,0.540635,0.63743,0.726091,0.975852,0.5163,0.322956,0.795186,0.270832,...,0.990505,0.412618,0.372018,0.776413,0.340804,0.930757,0.858413,0.428994,0.750871,0.754543
4,0.103124,0.902553,0.505252,0.826457,0.32005,0.895523,0.389202,0.010838,0.905382,0.091287,...,0.455657,0.620133,0.277381,0.188121,0.463698,0.353352,0.583656,0.077735,0.974395,0.986211


In [19]:
# compute sum using typical pandas approach
%timeit df1+df2+df3+df4

394 ms ± 7.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [20]:
# compute result via pd.eval
%timeit pd.eval('df2+df2+df3+df4')

200 ms ± 16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [21]:
# eval() is 50 % faster and less memory
np.allclose(df1+df2+df3+df4,pd.eval('df1+df2+df3+df4'))

True

In [24]:
# operation supported by pd.eval()
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
                           for i in range(5))
df1.head()

Unnamed: 0,0,1,2
0,710,676,749
1,813,241,915
2,421,476,480
3,850,463,934
4,347,522,819


In [25]:
# arithmetic operatiors
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)


True

In [26]:
# comparsion ,bitwise ,object attributes and indices also possible
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)


True

In [28]:
# DataFrame.eval() for column-wise operation
# The benefit of the eval() method is that columns can be referred to by name. 
df=pd.DataFrame(rng.rand(1000,3),columns=['A','B','C'])
df.head()

Unnamed: 0,A,B,C
0,0.401791,0.973228,0.005811
1,0.453365,0.715901,0.635402
2,0.171049,0.17561,0.0045
3,0.25466,0.513748,0.754389
4,0.897135,0.64913,0.368049


In [30]:
# pd.eval() compute expression within three columns
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)



True

In [31]:
# DataFrame.eval () method allow much more succinct evaluation
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)


True

In [32]:
'''
we can use df.eval() to create a new column 'D' and assign to it
a value computed from other columns
'''
df.eval('D=(A+B)/C',inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.401791,0.973228,0.005811,236.627079
1,0.453365,0.715901,0.635402,1.840199
2,0.171049,0.17561,0.0045,77.033882
3,0.25466,0.513748,0.754389,1.018584
4,0.897135,0.64913,0.368049,4.201252


In [33]:
# in the same way, any existing column can be modified
df.eval('D=(A-B)/C',inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.401791,0.973228,0.005811,-98.33866
1,0.453365,0.715901,0.635402,-0.413181
2,0.171049,0.17561,0.0045,-1.013469
3,0.25466,0.513748,0.754389,-0.343441
4,0.897135,0.64913,0.368049,0.673837


In [36]:
# local variable in DataFrame.eval()
column_mean = df.mean(1)

result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)


True

## DataFrame.query () Method

In [38]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)


True

In [39]:
# use filtering operation
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)


True

## Perfomance : when to use these functions

When considering whether to use these functions, there are two considerations: computation time and memory use. Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas Data Frames will result in implicit creation of temporary arrays

In [42]:
x=df[(df.A < 0.5) & (df.B < 0.5)]
x.head()

Unnamed: 0,A,B,C,D
2,0.171049,0.17561,0.0045,-1.013469
9,0.103617,0.31753,0.185907,-1.150645
14,0.0388,0.226129,0.642966,-0.291351
16,0.17886,0.459081,0.815575,-0.343588
22,0.246868,0.28349,0.561827,-0.065183


In [43]:
# is roughly equivalent
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]
x.head()

Unnamed: 0,A,B,C,D
2,0.171049,0.17561,0.0045,-1.013469
9,0.103617,0.31753,0.185907,-1.150645
14,0.0388,0.226129,0.642966,-0.291351
16,0.17886,0.459081,0.815575,-0.343588
22,0.246868,0.28349,0.561827,-0.065183


In [44]:
# Size of the memory
df.values.nbytes

32000

On the performance side, eval() can be faster even when you are not maxing out your system memory. The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then eval() can avoid some potentially slow movement of values between the different memory caches. In practice, I find that the difference in computation time between the traditional methods and the eval/query method is usually not significant—if anything, the traditional method is faster for smaller arrays! The benefit of eval/query is mainly in the saved memory, and the sometimes cleaner syntax they offer