# Refactoring the code to improve speed

In [1]:
import pandas as pd
from jupyterworkflow.data import get_freemont_data

data = get_freemont_data()

- parsing all the strings (there's a lot here!) is the main focator responsible of our slow code. You can check by removing the last argument of 'read_csv' function.

In [2]:
data = pd.read_csv('freemont.csv', index_col='Date')
data.index.dtype

dtype('O')

- problem is, by default our index will be objects (strings) instead of dates

In [3]:
data = pd.read_csv('freemont.csv', index_col='Date', parse_dates=True)
data.index.dtype

dtype('<M8[ns]')

- need to parse it, but using another way
- pandas in-built function yields the same result and problem

In [4]:
data = pd.read_csv('freemont.csv', index_col='Date')
pd.to_datetime(data.index)

DatetimeIndex(['2012-10-03 00:00:00', '2012-10-03 01:00:00',
               '2012-10-03 02:00:00', '2012-10-03 03:00:00',
               '2012-10-03 04:00:00', '2012-10-03 05:00:00',
               '2012-10-03 06:00:00', '2012-10-03 07:00:00',
               '2012-10-03 08:00:00', '2012-10-03 09:00:00',
               ...
               '2017-06-30 14:00:00', '2017-06-30 15:00:00',
               '2017-06-30 16:00:00', '2017-06-30 17:00:00',
               '2017-06-30 18:00:00', '2017-06-30 19:00:00',
               '2017-06-30 20:00:00', '2017-06-30 21:00:00',
               '2017-06-30 22:00:00', '2017-06-30 23:00:00'],
              dtype='datetime64[ns]', name='Date', length=41568, freq=None)

- here is our default format

In [5]:
data = pd.read_csv('freemont.csv', index_col='Date')
data.index

Index(['10/03/2012 12:00:00 AM', '10/03/2012 01:00:00 AM',
       '10/03/2012 02:00:00 AM', '10/03/2012 03:00:00 AM',
       '10/03/2012 04:00:00 AM', '10/03/2012 05:00:00 AM',
       '10/03/2012 06:00:00 AM', '10/03/2012 07:00:00 AM',
       '10/03/2012 08:00:00 AM', '10/03/2012 09:00:00 AM',
       ...
       '06/30/2017 02:00:00 PM', '06/30/2017 03:00:00 PM',
       '06/30/2017 04:00:00 PM', '06/30/2017 05:00:00 PM',
       '06/30/2017 06:00:00 PM', '06/30/2017 07:00:00 PM',
       '06/30/2017 08:00:00 PM', '06/30/2017 09:00:00 PM',
       '06/30/2017 10:00:00 PM', '06/30/2017 11:00:00 PM'],
      dtype='object', name='Date', length=41568)

- we can speed-up things by telling the format to the code, using python 'strftime' format. Parsing then get quasi-instant
- in case the format change mid-data, we will have problems. We thus insert a test and a secondary solution in case of an exception

In [6]:
data = pd.read_csv('freemont.csv', index_col='Date')

try:
    data.index = pd.to_datetime(data.index, format='%m/%d/%Y %H:%M:%S %p')
except TypeError:
    data.index = pd.to_datetime(data.index)

- put those last lines in the data.py file and re-run the unit test ti see how things have been speed-up
- unit test can also help refactoring the code for speed