# Performance


*Avoiding slow code*

With pandas, you'll get the most bang for your performance-buck by avoiding antipatterns. Once you've done that there are additional options like using Numba or Cython if you really need to optimize a piece of code, but that's more work typically.

This notebook will walk through several common miskates, and show more performant ways of achieving the same thing.

In [2]:
import numpy as np
import pandas as pd

## Mistake 1: Using pandas

* At least not for things it's not meant for.
* Pandas is very fast at joins, reindex, factorization
* Not as great at, say, matrix multiplications or problems that aren't vectorizable

## Mistake 2: Using object dtype

Jake VanderPlas has a [great article](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) on why Python is slow for many of the things we care about as analysts / scientists. One reason is the overhead that comes from  using python objects for integers, floats, etc. relative to their native versions in languages like C.

As a small demonstration, we'll take two series, one with python integers, and one with NumPy's int64.

In [3]:
# Two series of range(10000), different dtypes
s1 = pd.Series(range(10000), dtype=object)
s2 = pd.Series(range(10000), dtype=np.int64)

In [4]:
%timeit s1.sum()

601 µs ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [5]:
%timeit s2.sum()

44.5 µs ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


NumPy can process the specialized int64 dtype array faster than the python object version, even though they're equal.

Typically you would never expecitly pass in dtype=object there, but occasionally object dtypes slip into pandas

* Reading messy Excel Files

    read_excel will preserve the dtype of each cell in the spreadsheet. If you have a single column with an int, a float, and a datetime, pandas will have to store all of those as objects. This dataset probably isn't tidy though.

* Messy CSVs where pandas' usual inference fails

* "Exotic" data types like Dates, Times, Decimals

    Pandas has implemented a specialized verion of datetime.datime, and datetime.timedelta, but not datetime.date, datetime.time, Decimal, etc. Depending on your application, you might be able to treat dates as datetimess, at midnight.

As discussed in the [pandas documentation](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes), pandas uses NumPy's data types. Recent versions include more *extension types*, which are non-NumPy dtypes inside a Series or a DataFrame. Right now the most popular are

* Datetimes with Timezones
* Categorical
* StringDtype
* Nullable Integer
* Nullable Boolean

Previously, pandas couldn't natively store an array of integers with some missing values, since we used `np.nan` as a missing value indicator. `np.nan` is a float, and an array of some integers and some floats is just cast to an array of floats.

In [18]:
pd.Series([1, None, 2])

0    1.0
1    NaN
2    2.0
dtype: float64

You might have explicitly requested `dtype=object` to keep it from being cast to float.

In [19]:
pd.Series([1, None, 2], dtype=object)

0       1
1    None
2       2
dtype: object

But as we know, object-dtype is slow. It's better to use pandas nullable integer dtype:

In [20]:
pd.Series([1, None, 2], dtype=pd.Int64Dtype())

0       1
1    <NA>
2       2
dtype: Int64

In [21]:
a = pd.Series([None] + list(range(10000)), dtype=object)
%timeit a.mean()

595 µs ± 8.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [22]:
b = pd.Series([None] + list(range(10000)), dtype=pd.Int64Dtype())
%timeit b.mean()

131 µs ± 3.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


If you have some messy data in-memory (in a Python list, say) that you wish to convert, use one of pandas' parsers like

- `pd.to_numeric`
- `pd.to_datetime`
- `pd.to_timedelta`

### Categoricals

pandas has a [Categorical Data Type](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html) that represents data coming from a speficify, and typically *fixed* set of values. It's data model includes two components:

1. `categories`: An Index storing the set of valid values.
2. `ordered`: A boolean flag indicating whether there's an ordering between the values.

If you try to set a value that's not contained in the `categories`, you'll see an exception.

In [26]:
a = pd.Categorical(['good', 'better', 'good', 'best'],
                   categories=['good', 'better', 'best'],
                   ordered=True)
a

['good', 'better', 'good', 'best']
Categories (3, object): ['good' < 'better' < 'best']

In [28]:
a[0] = 'OK'

ValueError: Cannot setitem on a Categorical with a new category, set the categories first

While the primary intent of categoricals was to model data with a fixed set of allowed values, their implementation suggests another use: saving memory and improving performance.

When you make a Categorical, pandas will *factorize* the values. This creates a mapping between the original value and an integer code.

In [29]:
a

['good', 'better', 'good', 'best']
Categories (3, object): ['good' < 'better' < 'best']

In [30]:
a.categories

Index(['good', 'better', 'best'], dtype='object')

In [31]:
a.codes

array([0, 1, 0, 2], dtype=int8)

Notice two things:

1. The value `"good"` is associated with code `0`, the first element in `.categories`.
2. The codes are stored as `int8`.

When you have many repeated values, whose regular representation are larger than int8, then storing the data as Categorical can have some performance benefits.

In [47]:
import sys

population = 321_000_000
bytes_per = sys.getsizeof("AL")  # two characters per state
print("{:,d} MB".format((population * bytes_per // 1_000_000)))

16,371 MB


How many MB would you need to store the same as a Categorical?

In [58]:
character_bytes = 50 * sys.getsizeof("AL")
bytes_per_person = 1  # np.int16 = 2 bytes
print("{:,d} MB".format((character_bytes + (population * bytes_per_person)) // 1_000_000))

321 MB
