# SimFin Tutorial 06 - Performance Tips

[Original repository on GitHub](https://github.com/simfin/simfin-tutorials)

This tutorial was originally written by [Hvass Labs](https://github.com/Hvass-Labs)

----

"Are you employed, Sir? You don't go out looking for a job dressed like that on a week-day, do you? Is this a ... what day is this?" &ndash; [The Big Lebowski](https://www.youtube.com/watch?v=xJjCnWm5cvE)

## Introduction

This is a collection of tips on how to improve performance when using the simfin package. It is assumed you are already familiar with the previous tutorials on the basics of simfin.

## Imports

In [1]:
%matplotlib inline
import pandas as pd

# Import the main functionality from the SimFin Python API.
import simfin as sf

# Import names used for easy access to SimFin's data-columns.
from simfin.names import *

In [2]:
# Version of the SimFin Python API.
sf.__version__

'0.3.0'

## Config

In [3]:
# SimFin data-directory.
sf.set_data_dir('~/simfin_data/')

In [4]:
# SimFin load API key or use free data.
sf.load_api_key(path='~/simfin_api_key.txt', default_key='free')

## Load Datasets

In these examples, we will use the following datasets:

In [5]:
%%time
# Data for USA.
market = 'us'

# Daily Share-Prices.
df_prices = sf.load_shareprices(variant='daily', market=market)

Dataset "us-shareprices-daily" on disk (2 days old).
- Loading from disk ... Done!
CPU times: user 15.6 s, sys: 1.26 s, total: 16.9 s
Wall time: 15.4 s


In [6]:
df_prices.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SimFinId,Open,Low,High,Close,Adj. Close,Dividend,Volume
Ticker,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A,2007-01-03,45846,34.99,34.05,35.48,34.3,22.95,,2574600
A,2007-01-04,45846,34.3,33.46,34.6,34.41,23.03,,2073700
A,2007-01-05,45846,34.3,34.0,34.4,34.09,22.81,,2676600
A,2007-01-08,45846,33.98,33.68,34.08,33.97,22.73,,1557200
A,2007-01-09,45846,34.08,33.63,34.32,34.01,22.76,,1386200


## Disk Cache

Some functions take a long time to process data, such as the signal-functions in the simfin package. If you want to rerun a Notebook, then you would have to rerun all these slow functions again, even though the results would be exactly the same, if the data has not changed.

A simple solution is to cache the results of slow functions, by writing the results to a cache-file on disk. The next time the function is called, it automatically checks if a recent cache-file exists on disk and then loads it, otherwise the slow function will be computed and the results saved in the cache-file for future use.

This is implemented by using a so-called decorator or wrapper-function ` @sf.cache` on the slow function. This is used in simfin's signal-functions, and you can also use this wrapper on your own functions (see below).

A few things should be noted:

1. The wrapper adds three more arguments to the original function: `cache_name` which allows you to distinguish cache-files from each other. `cache_refresh_days` which sets the number of days before the slow function is recomputed and the results are saved to the cache-file. `cache_format` which sets the format for the cache-files.

2. Because of these new arguments, you **MUST** use keyword arguments when calling the wrapped function, otherwise the arguments will get passed to the cache-wrapper instead of the original function. This will raise a strange exception.

3. By default `cache_format='pickle'` so the cache-files are saved as an uncompressed pickle-file, which is very fast to save and load, but also takes a lot of disk-space. You may compress the pickle-files using `cache_format='pickle.gz'` which can compress DataFrames with much repetitive data (e.g. forward-filled daily signals) by a factor of 100 or more, but this requires a little more computation time. Other file-formats such as `'parquet'` and `'feather'` are also supported, but these have some restrictions on the DataFrames they can save.

4. The cache-wrapper cannot detect if the original data being processed has changed since the function was computed last, it can only check how old the cache-file is on disk, and compare that to the argument `cache_refresh_days` to decide whether to recompute the slow function. So you may need to manually force a cache-refresh by passing the argument `cache_refresh_days=0` if you suspect the cached result was computed using older data.

5. One way of ensuring you are always using fresh data and the signals have been computed using the newest data, is to schedule a script that downloads new data from the SimFin server every day, and then computes the slow signal-functions afterwards. You do this by passing the argument `refresh_days=0` to all the `sf.load()` functions and `cache_refresh_days=0` to the signal-functions. When you load the data and signals in your Notebooks for further analysis, you pass the arguments `refresh_days=1` and `cache_refresh_days=1` so the Notebook uses the data and cache-files from disk.

### Caching a SimFin Function

Here is an example of a function from the simfin package for calculating share-price signals. This takes about 30 seconds to compute:

In [7]:
%%time
df_price_signals = sf.price_signals(df_prices=df_prices)

CPU times: user 36.5 s, sys: 625 ms, total: 37.1 s
Wall time: 34.1 s


The function `sf.price_signals` is actually wrapped with ` @sf.cache` so the caching-feature is automatically enabled if we pass the argument `cache_name`.

In [8]:
# Name for the cache e.g. 'us-all'
cache_name = market + '-all'

# Refresh the cache once a day.
cache_refresh_days = 1

In [9]:
%%time
df_price_signals2 = \
    sf.price_signals(df_prices=df_prices,
                     cache_name=cache_name,
                     cache_refresh_days=cache_refresh_days)

Cache-file 'price_signals-us-all.pickle' not on disk.
- Running function price_signals() ... Done!
- Saving cache-file to disk ... Done!
CPU times: user 37.7 s, sys: 804 ms, total: 38.5 s
Wall time: 35.4 s


The first time the function is called, it will compute the signals and save the resulting DataFrame to a cache-file on disk. When the function is called again, the cached DataFrame will be loaded instead. When the cache-file is too old, the function is called again and a new cache-file is saved to disk.

Note that the cache-file is named `price_signals-us-all.pickle` which is constructed from the function's name `price_signals`, the cache-name we have supplied `us-all`, and the file-extension `.pickle`. This keeps the cache-files neatly organized on disk, while still allowing us to designate different cache-names for different calls of the same function, for example if we want to process different markets or stocks.

If you want to pass the same cache-arguments to several functions, then it is more convenient to create a dict with the arguments:

In [10]:
cache_args = {'cache_name': cache_name,
              'cache_refresh_days' : cache_refresh_days}

In [11]:
%%time
df_price_signals3 = \
    sf.price_signals(df_prices=df_prices, **cache_args)

Cache-file 'price_signals-us-all.pickle' on disk (0 days old).
- Loading cache-file from disk ... Done!
CPU times: user 78.7 ms, sys: 140 ms, total: 219 ms
Wall time: 217 ms


We can check that the results are all identical:

In [12]:
df_price_signals.equals(df_price_signals2)

True

In [13]:
df_price_signals.equals(df_price_signals3)

True

### Caching Your Own Functions

You can also use the caching-feature on your own functions simply by adding the decorator ` @sf.cache` to your function declaration. The default 'pickle' file-format should support all Pandas DataFrames and Series and properly save all meta-data such as which columns are used as indices, etc.

But if you want to use the Parquet file-format instead, then it only supports Pandas DataFrames (not Series). The column-names must also start with a letter. There may be other requirements imposed by the Parquet file-format used for the cache-file, and you will get an exception if you violate the requirements. The Feather file-format is even more basic and cannot save DataFrames with MultiIndex. So it is best to use the default pickle-format or the compressed pickle-format.

Here is an example of a function that calculates the sum of each row. Because this results in a Pandas Series without a name, if we were using Parquet as the cache-format, then we would have to call `.to_frame(name='Sum')` to convert it into a Pandas DataFrame with a valid name, so it could be saved in the Parquet format. But we are using the default pickle-format, which can save it as is:

In [14]:
@sf.cache
def my_function(df):
    return df.sum(axis=1)

Remember that you **MUST** call `my_function()` with named arguments! Otherwise you will get a strange exception. The reason is that the decorator has actually created a new function which takes the arguments: `cache_name`, `cache_refresh_days`, `cache_format` and `**kwargs`, as we can see from this slightly cryptic specification of the function:

In [15]:
import inspect
inspect.getfullargspec(my_function)

FullArgSpec(args=['cache_name', 'cache_refresh_days', 'cache_format'], varargs=None, varkw='kwargs', defaults=(None, 1, 'pickle'), kwonlyargs=[], kwonlydefaults=None, annotations={})

So if you call `my_function()` with unnamed arguments, it expects the first arguments to be `cache_name`, `cache_refresh_days` and `cache_format`, while any remaining keyword arguments are passed to the original function. This raises a strange exception:

In [16]:
try:
    df_result = my_function(df_prices)
except Exception as e:
    print(e)

ufunc 'add' did not contain a loop with signature matching types dtype('<U21') dtype('<U21') dtype('<U21')


So you **MUST** use keyword-arguments when calling a function wrapped with ` @sf.cache`. But we can still call the function without the `cache_name`, in which case it will disable the caching and just call the function as normal, but again you **MUST** use named arguments such as `df=df_prices` instead of just `df_prices`:

In [17]:
%%time
df_result = my_function(df=df_prices)

CPU times: user 1.05 s, sys: 228 ms, total: 1.28 s
Wall time: 555 ms


In [18]:
df_result.head()

Ticker  Date      
A       2007-01-03    2620607.77
        2007-01-04    2119705.80
        2007-01-05    2722605.60
        2007-01-08    1603204.44
        2007-01-09    1432204.80
dtype: float64

If we pass the cache-arguments then the caching is automatically enabled:

In [19]:
# Cache arguments.
cache_name = market + '-all'
cache_refresh_days = 1

In [20]:
%%time
df_result2 = my_function(df=df_prices,
                         cache_name=cache_name,
                         cache_refresh_days=cache_refresh_days)

Cache-file 'my_function-us-all.pickle' not on disk.
- Running function my_function() ... Done!
- Saving cache-file to disk ... Done!
CPU times: user 1.1 s, sys: 336 ms, total: 1.44 s
Wall time: 705 ms


We may also create a dict with the cache-arguments, which is convenient if we want to use the same arguments in several functions:

In [21]:
cache_args = {'cache_name': cache_name,
              'cache_refresh_days' : cache_refresh_days}

In [22]:
%%time
df_result3 = my_function(df=df_prices, **cache_args)

Cache-file 'my_function-us-all.pickle' on disk (0 days old).
- Loading cache-file from disk ... Done!
CPU times: user 49.7 ms, sys: 43.9 ms, total: 93.6 ms
Wall time: 92 ms


Even for such a fairly quick function, the caching still saved a lot of time when using the raw pickle-format. But normally you would only use the caching-feature on functions that are very slow to compute.

We can check that the results are all identical:

In [23]:
df_result.equals(df_result2)

True

In [24]:
df_result.equals(df_result3)

True

## License (MIT)

This is published under the
[MIT License](https://github.com/simfin/simfin-tutorials/blob/master/LICENSE.txt)
which allows very broad use for both academic and commercial purposes.

You are very welcome to modify and use this source-code in your own project. Please keep a link to the [original repository](https://github.com/simfin/simfin-tutorials).
