# Timestamp Decode Speed Test

Traffic counter files contain rows of timestamps encoded in the format `DD-MMM-YYYY HH:mm:ss`, eg. `01-Jan-2011 00:00:00`.  By default pandas timestamp reading is automatic and comprehensive, but slow.  This notebook documents the methods I attempted to speed up this reading using the packages [pendulum](https://pendulum.eustace.io/docs/) and [ciso8601](https://github.com/closeio/ciso8601).

TL:DR: don't bother.  Just use `pd.to_datetime(..., infer_timestamp_format=True)`.

In [1]:
import pandas as pd
import numpy as np
import ciso8601
import pendulum

## Read in data

In [4]:
df = pd.read_csv('../20190919_intermediate_file_io/20050591_17636_2011.txt',
                 sep='\t', header=None, usecols=[3, 4])
df.columns = ['Timestamp', 'Count']
df['Timestamp_datetime'] = pd.to_datetime(df['Timestamp'], infer_datetime_format=True)
df['Timestamp_isostring'] = df['Timestamp_datetime'].apply(
    lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))

## Single ISO8601 timestamp conversion test

Our timestamps are not in ISO8601 (`YYYY:MM:DD HH:mm:ss`) format, but this can serve as a baseline.

In [6]:
%timeit pd.to_datetime('2014-12-05 12:30:45')

124 µs ± 8.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [8]:
%timeit pendulum.parse('2014-12-05 12:30:45')

7.22 µs ± 407 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [9]:
%timeit ciso8601.parse_datetime('2014-12-05 12:30:45')

129 ns ± 7.64 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


Obviously `ciso8601` wins this, but either package is far superior to pandas for speed.

## Array of ISO8601 timestamps conversion test

In [10]:
%timeit pd.to_datetime(df['Timestamp_isostring'].values)

5.66 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [11]:
%timeit df['Timestamp_isostring'].apply(ciso8601.parse_datetime)

8.21 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [12]:
%timeit df['Timestamp_isostring'].apply(pendulum.parse)

AttributeError: 'Pendulum' object has no attribute 'nanosecond'

In [14]:
# Okay, that didn't work.  What if we explicitly defined an apply function?

def parse_with_pendulum(text):
    return np.datetime64(pendulum.parse(text, tz=None))

%timeit df['Timestamp_isostring'].apply(parse_with_pendulum)

  after removing the cwd from sys.path.


1.58 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


So a lack of vectorization in `ciso8601.parse_datetime` (I couldn't find any resources on how `ciso` can support vectorization) and [general incompatibility of](https://stackoverflow.com/questions/47849342/making-pandas-work-with-pendulum) [Pendulum with pandas](https://github.com/sdispater/pendulum/issues/131) makes it difficult to handle this.

Indeed, I had to do something like:

```python
monthno = {"Jan": "01",
           "Feb": "02",
           "Mar": "03",
           "Apr": "04",
           "May": "05",
           "Jun": "06",
           "Jul": "07",
           "Aug": "08",
           "Sep": "09",
           "Oct": "10",
           "Nov": "11",
           "Dec": "12"}

monthno_comp = {re.compile(k): v for k, v in monthno.items()}


def parse_text(text):
    for pattern, replacement in monthno_comp.items():
        text = pattern.sub(replacement, text)
    return np.datetime64(pendulum.from_format(text, 'DD-MM-YYYY HH:mm:ss', tz=None))
```

to convert Arman's timestamps to something that could be parsed by pendulum.

## Array of Matlab Timestamps with pandas

In [15]:
%timeit pd.to_datetime(df['Timestamp'])

6.14 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
%timeit pd.to_datetime(df['Timestamp'], infer_datetime_format=True)

258 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Okay, just use `infer_datetime_format` for fast handling of uniform-format timestamps.