# Performance


*Avoiding slow code*

With pandas, you'll get the most bang for your performance-buck by avoiding antipatterns. Once you've done that there are additional options like using Numba or Cython if you really need to optimize a piece of code, but that's more work typically.

This notebook will walk through several common miskates, and show more performant ways of achieving the same thing.

In [None]:
import numpy as np
import pandas as pd

## Mistake 1: Using pandas

pandas isn't always the right choice. If you're dealing with non-tabular data, or lots of linear algebra, you might be better off using something else like Python lists / dicts / sets, or raw NumPy arrays.

## Mistake 2: Using object dtype

Jake VanderPlas has a [great article](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) on why Python is slow for many of the things we care about as analysts / scientists. One reason is the overhead that comes from  using python objects for integers, floats, etc. relative to their native versions in languages like C.

As a small demonstration, we'll make two series, one with python integers, and one with NumPy's int64.

In [None]:
# Two series of range(10000), different dtypes
s1 = pd.Series(range(10000), dtype=object)
s2 = pd.Series(range(10000), dtype=np.int64)

Now let's do a simple operation on them, like taking the sum.

In [None]:
%timeit s1.sum()

In [None]:
%timeit s2.sum()

NumPy can process the specialized int64 dtype array faster than the python object version, even though they're equal. Part of this comes from the different algorithms (the NumPy version would overflow with very large integers), and part comes from the Python version having to repeated unbox the actual integer from the Python object, and re-box it for the result.

Typically you would never expecitly pass in dtype=object there, but occasionally object dtypes slip into pandas

* Reading messy Excel Files / CSV files

  These file types either don't have or don't enforce types. pandas has to infer dtypes, which doesn't always go as expected

* "Exotic" data types like Dates, Times, Decimals

    Pandas has implemented a specialized verion of datetime.datime, and datetime.timedelta, but not datetime.date, datetime.time, Decimal, etc. Depending on your application, you might be able to treat dates as datetimess, at midnight.

As discussed in the [pandas documentation](https://pandas.pydata.org/docs/user_guide/basics.html#dtypes), pandas uses NumPy's data types. Recent versions include more *extension types*, which are non-NumPy dtypes inside a Series or a DataFrame. Right now the most popular are

* Datetimes with Timezones
* Categorical
* StringDtype
* Nullable Integer
* Nullable Boolean

Previously, pandas couldn't natively store an array of integers with some missing values, since we used `np.nan` as a missing value indicator. `np.nan` is a float, and an array of some integers and some floats is just cast to an array of floats.

In [None]:
pd.Series([1, None, 2])

You might have explicitly requested `dtype=object` to keep it from being cast to float.

In [None]:
pd.Series([1, None, 2], dtype=object)

But as we know, object-dtype is slow. It's better to use pandas nullable integer dtype:

In [None]:
pd.Series([1, None, 2], dtype=pd.Int64Dtype())

In [None]:
a = pd.Series([None] + list(range(10000)), dtype=object)
%timeit a.mean()

In [None]:
b = pd.Series([None] + list(range(10000)), dtype=pd.Int64Dtype())
%timeit b.mean()

If you have some messy data in-memory (in a Python list, say) that you wish to convert, use one of pandas' parsers like

- `pd.to_numeric`
- `pd.to_datetime`
- `pd.to_timedelta`

### Categoricals

pandas has a [Categorical Data Type](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html) that represents data coming from a speficify, and typically *fixed* set of values. It's data model includes two components:

1. `categories`: An Index storing the set of valid values.
2. `ordered`: A boolean flag indicating whether there's an ordering between the values.

If you try to set a value that's not contained in the `categories`, you'll see an exception.

In [None]:
a = pd.Categorical(['good', 'better', 'good', 'best'],
                   categories=['good', 'better', 'best'],
                   ordered=True)
a

In [None]:
a[0] = 'OK'

While the primary intent of categoricals was to model data with a fixed set of allowed values, their implementation suggests another use: saving memory and improving performance.

When you make a Categorical, pandas will *factorize* the values. This creates a mapping between the original value and an integer code.

In [None]:
a

In [None]:
a.categories

In [None]:
a.codes

Notice two things:

1. The value `"good"` is associated with code `0`, the first element in `.categories`.
2. The codes are stored as `int8`.

When you have many repeated values, whose regular representation are larger than int8, then storing the data as Categorical can have some performance benefits.

To demonstrate this, let's suppose you had a table with every adult resident of the United States (about 321,000,000 rows) where one column stores the state abbreviation as a string.

Just storing this as an object-dtype array would cost the size of the 2-character string times the 321,000,000 occurances:

In [None]:
import sys

population = 321_000_000
bytes_per = sys.getsizeof("AL")  # two characters per state
print("{:,d} MB".format((population * bytes_per // 1_000_000)))

Or about 16GB. How many MB would you need to store the same as a Categorical?

In [None]:
%load solutions/performance_categorical.py

So when there are many repeated values the savings can be dramatic.

## Vectorization

Just like with NumPy, *vectorization* is key to getting good performance out of pandas.
The short definition is "don't do for loops", the longer definition is "let NumPy do the for loop in C".

As an example, let's grab some data on airports locations. We'll compote the distances between pairs of airports.

In [None]:
airports = pd.read_csv(
    "https://vega.github.io/vega-datasets/data/airports.csv",
    index_col="iata",
    nrows=500,
)
airports

This next block is a bit of pandas magic to build a DataFrame with each pair.

In [None]:
columns = ["longitude", "latitude"]
idx = pd.MultiIndex.from_product([airports.index, airports.index],
                                 names=['orig', 'dest'])
subset = idx.get_level_values(0) > idx.get_level_values(1)

pairs = pd.concat([
    airports[columns]
        .add_suffix('_orig')
        .reindex(idx, level='orig'),
    airports[columns]
        .add_suffix('_dest')
        .reindex(idx, level='dest')
    ], axis="columns"
)[subset]
pairs

In [None]:
import math


def gcd_py(lat1, lng1, lat2, lng2):
    '''
    Calculate great circle distance between two points.
    https://www.johndcook.com/blog/python_longitude_latitude/

    Parameters
    ----------
    lat1, lng1, lat2, lng2: float

    Returns
    -------
    distance:
      distance from ``(lat1, lng1)`` to ``(lat2, lng2)`` in kilometers.
    '''
    degrees_to_radians = math.pi / 180.0
    ϕ1 = (90 - lat1) * degrees_to_radians
    ϕ2 = (90 - lat2) * degrees_to_radians

    θ1 = lng1 * degrees_to_radians
    θ2 = lng2 * degrees_to_radians

    cos = (math.sin(ϕ1) * math.sin(ϕ2) * math.cos(θ1 - θ2) +
           math.cos(ϕ1) * math.cos(ϕ2))
    # round to avoid precision issues on identical points causing ValueErrors
    cos = round(cos, 8)
    arc = math.acos(cos)
    return arc * 6373  # radius of earth, in kilometers

In [None]:
def gcd_vec(lat1, lng1, lat2, lng2):
    '''
    Calculate great circle distance.
    https://www.johndcook.com/blog/python_longitude_latitude/

    Parameters
    ----------
    lat1, lng1, lat2, lng2: float or array of float

    Returns
    -------
    distance:
      distance from ``(lat1, lng1)`` to ``(lat2, lng2)`` in kilometers.
    '''
    ϕ1 = np.deg2rad(90 - lat1)
    ϕ2 = np.deg2rad(90 - lat2)

    θ1 = np.deg2rad(lng1)
    θ2 = np.deg2rad(lng2)

    cos = (np.sin(ϕ1) * np.sin(ϕ2) * np.cos(θ1 - θ2) +
           np.cos(ϕ1) * np.cos(ϕ2))
    # round to avoid precision issues on identical points causing warnings
    cos = np.round(cos, 8)
    arc = np.arccos(cos)
    return arc * 6373 # radius of earth, in kilometers

In [None]:
%%time
# gcd_py with DataFrame.apply
r = pairs.apply(
    lambda x: gcd_py(x['latitude_orig'],
                     x['longitude_orig'],
                     x['latitude_dest'],
                     x['longitude_dest']),
                axis="columns"
);

In [None]:
%%time
# gcd_py with manual iteration
_ = pd.Series([gcd_py(*x) for x in pairs.itertuples(index=False)],
              index=pairs.index)

In [None]:
%%time
# gcd_vec
r = gcd_vec(pairs['latitude_orig'], pairs['longitude_orig'],
            pairs['latitude_dest'], pairs['longitude_dest'])

For more, consult these pages from pandas' documentation:

* Enhancing performance: https://pandas.pydata.org/docs/user_guide/enhancingperf.html
* Scaling to larger datasets: https://pandas.pydata.org/docs/user_guide/scale.html