In [1]:
# dependencies
from random import randint
import pandas as pd
import seaborn as sns

In [2]:
tips = sns.load_dataset("tips")

In [90]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [35]:
tips.values.nbytes

13664

In [38]:
tips.memory_usage()

Index          128
total_bill    1952
tip           1952
sex            368
smoker         368
day            448
time           368
size          1952
dtype: int64

## Intro

There are a number of ways to collect or operate on a subset of data based on a given condition, each have their strengths and use cases. Notably, `pd.eval()`, `df.eval()`, and `df.query()` can do some of these tasks while offering a not inconsequential boost to performance. _For a more thorough walk-through of each built-in method, see [this chapter](https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html) from the Python Data Science Handbook._

#### Part 1: EDA improvements
- Using `pd.eval()` to wrap slower `.loc` calls

#### Part 2: Data Cleaning improvements
- Using `df.eval()`, `df.query()` to create and inspect columns

### Exploratory Data Analysis

In [91]:
mask = (tips.day == 'Sun')

In [92]:
%timeit tips[mask].total_bill.mean()

178 µs ± 921 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [93]:
%timeit tips.loc[mask].total_bill.mean()

178 µs ± 860 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [94]:
%timeit tips[(tips.day == 'Sun')].total_bill.mean()

236 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [95]:
%timeit tips.loc[(tips.day == 'Sun')].total_bill.mean()

233 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [96]:
%timeit pd.eval("tips[(tips.day == 'Sun')].total_bill.mean()")

6.09 ms ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [97]:
%timeit pd.eval("tips.loc[(tips.day == 'Sun')].total_bill.mean()")

6.03 ms ± 37.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Comments:
`pd.eval()` seems slower? Am I understanding those units correctly?
microseconds are smaller than milliseconds, so common practice methods should stand?

Not only is `pd.eval()` beat, but it's beat by the 