# Pandas de alto rendimiento
pg. 208

# Intro a query() y eval()

El poder de PyData es poner operaciones basicas en C. Vertorizar las operaciones para que esas abstracciones sean eficientes y efectivas. Pero normalmente caen en crear objetos intermedios temporales.  
Pandas incluye operaciones en "velocidad C" sin objetos intermedios con `eval()` y `query()`.

In [1]:
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)

In [1]:
%timeit x + y

1.12 ms ± 39.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Son mucho mas eficientes que los loops de Python. Pero si se computan expresiones compuestas pierden mucha efectividad. Cada paso intermedio es alojado en memoria.

### Pandas.eval() para operaciones eficientes.

In [2]:
import pandas as pd

In [3]:
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)

In [4]:
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
                      for i in range(4))

In [7]:
%timeit df1 + df2 + df3 + df4

50.4 ms ± 972 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [8]:
# Usando pd.eval()
%timeit pd.eval('df1 + df2 + df3 + df4')

51 ms ± 895 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [9]:
# No se ve la gran mejora que en el libro. Quiza ya han optimizado
# ciertas operaciones.

### DataFrame.eval() para operaciones con columnas.

Los DF tienen un método eval() que se beneficia de poder referenciar las columnas por su nombre.

In [10]:
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()

Unnamed: 0,A,B,C
0,0.615875,0.525167,0.047354
1,0.330858,0.412879,0.441564
2,0.689047,0.559068,0.23035
3,0.290486,0.695479,0.852587
4,0.42428,0.534344,0.245216


In [16]:
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result1

0     -1.197761
1     -1.331822
2     -1.621667
3     -6.688481
4     -1.270064
         ...   
995   -3.349773
996   -2.163240
997   -0.936554
998   -2.263292
999   -3.781258
Length: 1000, dtype: float64

In [17]:
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
result2

0     -1.197761
1     -1.331822
2     -1.621667
3     -6.688481
4     -1.270064
         ...   
995   -3.349773
996   -2.163240
997   -0.936554
998   -2.263292
999   -3.781258
Length: 1000, dtype: float64

In [18]:
result3 = df.eval('(A + B) / (C - 1)')
result3

0     -1.197761
1     -1.331822
2     -1.621667
3     -6.688481
4     -1.270064
         ...   
995   -3.349773
996   -2.163240
997   -0.936554
998   -2.263292
999   -3.781258
Length: 1000, dtype: float64

### Asignar columnas

Se pueden crear columnas o modifcar las existentes

In [19]:
df.eval('D = (A + B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.615875,0.525167,0.047354,24.095868
1,0.330858,0.412879,0.441564,1.684325
2,0.689047,0.559068,0.23035,5.418335
3,0.290486,0.695479,0.852587,1.156439
4,0.42428,0.534344,0.245216,3.909296


In [20]:
df.eval('D = (A - B) / C', inplace=True)
df.head()

Unnamed: 0,A,B,C,D
0,0.615875,0.525167,0.047354,1.915527
1,0.330858,0.412879,0.441564,-0.185752
2,0.689047,0.559068,0.23035,0.564268
3,0.290486,0.695479,0.852587,-0.475016
4,0.42428,0.534344,0.245216,-0.448844


### Variables locales

In [21]:
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)

True

Con el caracter `@` podemos referenciar variables locales de python.

### DataFrame.query()

Para operaciones de filtrado. También acepta `@`

In [26]:
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result1

Unnamed: 0,A,B,C,D
1,0.330858,0.412879,0.441564,-0.185752
8,0.448611,0.415924,0.481001,0.067958
10,0.112910,0.394884,0.950129,-0.296774
11,0.191011,0.118751,0.130223,0.554895
14,0.075723,0.260648,0.956146,-0.193407
...,...,...,...,...
964,0.478935,0.196736,0.913372,0.308964
967,0.498382,0.465993,0.664128,0.048768
980,0.150918,0.382386,0.305427,-0.757852
982,0.207822,0.356162,0.653230,-0.227087


In [27]:
result2 = df.query('A < 0.5 and B < 0.5')
result2

Unnamed: 0,A,B,C,D
1,0.330858,0.412879,0.441564,-0.185752
8,0.448611,0.415924,0.481001,0.067958
10,0.112910,0.394884,0.950129,-0.296774
11,0.191011,0.118751,0.130223,0.554895
14,0.075723,0.260648,0.956146,-0.193407
...,...,...,...,...
964,0.478935,0.196736,0.913372,0.308964
967,0.498382,0.465993,0.664128,0.048768
980,0.150918,0.382386,0.305427,-0.757852
982,0.207822,0.356162,0.653230,-0.227087


In [28]:
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')

### Rendimiento. Cuando usar estas funciones.

Hay dos consideraciones. Memoria y tiempo de computo.
Si el tamaño de los DF intermedios que se crean es muy grande (gigas), es buena idea usar eval() o query().  
En la practica el tiempo de computo no tiene cambios significantes. De hecho para pequeños DF incluso es mejor el método tradicional.
El principal motivo es salvar memoria.