<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="45%" align="right" border="4">

# Input-Output Operations

Dr. Yves J. Hilpisch

The Python Quants GmbH

<a href='mailto:yves@tpq.io'>yves@tpq.io</a> | <a href='http://tpq.io'>http://tpq.io</a>

This part addresses the following areas:

* **basic I/O with `Python`**
* **I/O with `pandas`**
* **I/O with `PyTables`**

... and a bit of `SQLite3`.


## Basic I/O with Python

### Writing Objects to Disk

For later use, for documentation or for sharing with others, one might want to **store Python objects on disk** (eg via serialization).

In [None]:
try:
    !mkdir data
except:
    pass

In [None]:
# replace "yves" by your unique user name
# AND create a folder "data" in your home directory
# path = '/notebooks/training2/yves/data/'
path = 'data/'

In [None]:
import numpy as np
from random import gauss
import seaborn as sns; sns.set()

In [None]:
a = [gauss(1.5, 2) for i in xrange(1000000)]
  # generation of normally distributed randoms

The task now is to write this `list` object to disk for later retrieval. We use the `pickle` module.

In [None]:
import pickle

In [None]:
pkl_file = open(path + 'data.pkl', 'w')
  # open file for writing
  # Note: existing file might be overwritten

The two major functions are `dump` and `load`.

In [None]:
%time pickle.dump(a, pkl_file)

In [None]:
pkl_file

In [None]:
pkl_file.close()

We can now inspect the **size of the file** on disk.

In [None]:
ll $path

Now that we have data on disk, we can **read it again to the memory**.

In [None]:
pkl_file = open(path + 'data.pkl', 'r')  # open file for reading

In [None]:
%time b = pickle.load(pkl_file)

In [None]:
b[:5]

To ensure that indeed both **objects `a` and `b` are the same**, `NumPy` provides the function `allclose`.

In [None]:
a is b

In [None]:
np.allclose(np.array(a), np.array(b))

What about **two objects**?

In [None]:
pkl_file = open(path + 'data.pkl', 'w')  # open file for writing

In [None]:
%time pickle.dump(np.array(a), pkl_file)

In [None]:
%time pickle.dump(np.array(a) ** 2, pkl_file)

In [None]:
pkl_file.close()

In [None]:
ll $path

Let us read the two `ndarray` objects **back into memory**.

In [None]:
pkl_file = open(path + 'data.pkl', 'r')  # open file for reading

In [None]:
x = pickle.load(pkl_file)
x

In [None]:
y = pickle.load(pkl_file)
y

In [None]:
pkl_file.close()


Obviously, `pickle` stores objects according to the **first in, first out** (FIFO) principle. Sometime better for multiple objects:

In [None]:
pkl_file = open(path + 'data.pkl', 'w')  # open file for writing
pickle.dump({'x' : x, 'y' : y}, pkl_file)
pkl_file.close()

In [None]:
pkl_file = open(path + 'data.pkl', 'r')  # open file for writing
data = pickle.load(pkl_file)
pkl_file.close()
for key in data.keys():
    print key, data[key][:4]

In [None]:
!rm -f $path/data.pkl

### Reading and Writing Text Files

**Text processing** can be considered a strength of `Python`. Consider that we have generated quite a large set of data that we want to save and share as a **comma separated value (CSV) file**.

In [None]:
rows = 500000
a = np.random.standard_normal((rows, 5))  # dummy data

In [None]:
a.round(4)

We add **date-time information** to the mix.

In [None]:
import pandas as pd
t = pd.date_range(start='2014/1/1', periods=rows, freq='H')
    # set of hourly datetime objects

In [None]:
t

Let us write the data as **`CSV` file**.

In [None]:
csv_file = open(path + 'data.csv', 'w')  # open file for writing

In [None]:
header = 'date,no1,no2,no3,no4,no5\n'
csv_file.write(header)

The actual data is then **written row by row**, merging the date-time information with the (pseudo-)random numbers.

In [None]:
%%time
for t_, (no1, no2, no3, no4, no5) in zip(t, a):
    s = '%s,%f,%f,%f,%f,%f\n' % (t_, no1, no2, no3, no4, no5)
    csv_file.write(s)
csv_file.close()

In [None]:
ll $path

Now let us **read the data** from the just written file.

In [None]:
csv_file = open(path + 'data.csv', 'r')  # open file for reading

In [None]:
%%time
for i in range(5):
    print csv_file.readline(),

You can also read the whole content **at once**.

In [None]:
%%time
csv_file = open(path + 'data.csv', 'r')
content = csv_file.readlines()
for line in content[:5]:
    print line,

Some **closing operations** to conclude the example.

In [None]:
csv_file.close()
!rm -f $path/*

### SQL Databases

Python can work with any kind of **`SQL` database** and in general also with any kind of **No-`SQL` database**. One database that is delivered with Python by default is `SQLite3` (cf. http://www.sqlite.org).

In [None]:
import sqlite3 as sq3

**Queries** are formulated as `string` objects. Here: creation of a table.

In [None]:
query = 'CREATE TABLE numbs (Date date, No1 real, No2 real)'

Then open a **database connection**.

In [None]:
con = sq3.connect(path + 'numbs.db')

Then **execute the query** statement and commit.

In [None]:
con.execute(query)

In [None]:
con.commit()

Next step is to **populate** the table with the data. 

In [None]:
import datetime as dt

In [None]:
# write single row
con.execute('INSERT INTO numbs VALUES(?, ?, ?)',
            (dt.datetime.now(), 0.12, 7.3))

Usually, one wants to write a larger data set **in bulk**.

In [None]:
data = np.random.standard_normal((10000, 2)).round(5)

In [None]:
for row in data:
    con.execute('INSERT INTO numbs VALUES(?, ?, ?)',
                (dt.datetime.now(), row[0], row[1]))
con.commit()

**Retrieving multiple rows** is easy.

In [None]:
con.execute('SELECT * FROM numbs').fetchmany(10)

Or you can just read a **single data row at a time**.

In [None]:
pointer = con.execute('SELECT * FROM numbs')

In [None]:
for i in range(3):
    print pointer.fetchone()

In [None]:
con.close()
!rm -f $path/numb*

### Writing and Reading Numpy Arrays

`NumPy` has its own I/O capabilities.

In [None]:
import numpy as np

We replicate the `SQLite3` example with a `NumPy` structured array.

In [None]:
dtimes = np.arange('2015-01-01 10:00:00', '2021-12-31 22:00:00',
                  dtype='datetime64[m]')  # minute intervals
len(dtimes)

In [None]:
dty = np.dtype([('Date', 'datetime64[m]'), ('No1', 'f'), ('No2', 'f')])
data = np.zeros(len(dtimes), dtype=dty)

Use the data to **populate** the different columns.

In [None]:
data['Date'] = dtimes

In [None]:
a = np.random.standard_normal((len(dtimes), 2)).round(5)
data['No1'] = a[:, 0]
data['No2'] = a[:, 1]

In [None]:
data[:4]

Writing and reading `ndarray` objects is **highly optimized** (hardware bound in general).

In [None]:
%time np.save(path + 'array', data)  # suffix .npy is added

In [None]:
ll $path/ar*

Reading is **even faster**.

In [None]:
%time np.load(path + 'array.npy')

Let us try a **larger data set**.

In [None]:
%time data = np.random.standard_normal((10000, 6000))

In [None]:
%time np.save(path + 'array', data) 

In [None]:
ll $path/ar*

And also **reading it**.

In [None]:
%time np.load(path + 'array.npy')

In [None]:
data = 0.0
!rm -f $path/array.npy

## I/O with pandas

One of the major strengths of the `pandas` library is that it can read and write different data formats natively, among others:

* `CSV` (comma separated value)
* `SQL` (structured query language)
* `XLS/XSLX` (Microsoft Excel files)
* `JSON` (JavaScript object notation)
* `HTML` (hypertext markup language)

Our test case is again be a **large set of floating point numbers** (1mn rows).

In [None]:
import numpy as np
import pandas as pd
data = np.random.standard_normal((1000000, 5)).round(5)
        # sample data set

In [None]:
filename = path + 'numbs'

### SQL Database

The **benchmark case** with `SQLite3`.

In [None]:
import sqlite3 as sq3

In [None]:
query = 'CREATE TABLE numbers (No1 real, No2 real,\
        No3 real, No4 real, No5 real)'

In [None]:
con = sq3.Connection(filename + '.db')

In [None]:
con.execute(query)

**Writing the data** in bulk.

In [None]:
%%time
con.executemany('INSERT INTO numbers VALUES (?, ?, ?, ?, ?)', data)
con.commit()

In [None]:
ll $path

**Reading is faster** then writing.

In [None]:
%%time
temp = con.execute('SELECT * FROM numbers').fetchall()
print temp[:2]
temp = 0.0

Reading a `SQL` query result into a `ndarray` object ...

In [None]:
%%time
query = 'SELECT * FROM numbers WHERE No1 > 0 AND No2 < 0'
res = np.array(con.execute(query).fetchall()).round(3)

... and plotting it.

In [None]:
res = res[::100]  # every 100th result
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(res[:, 0], res[:, 1], 'ro')
plt.grid(True); plt.xlim(-0.5, 4.5); plt.ylim(-4.5, 0.5)

### From SQL to pandas

`pandas` can be used to make such an operation more **convenient and efficient**.

In [None]:
# import pandas.io.sql as pds

The code for reading the data becomes a bit **more compact**.

In [None]:
%time data = pd.read_sql('SELECT * FROM numbers', con)

In [None]:
data.head()

The data is now **in-memory**. This allows for much **faster analytics**.

In [None]:
%time data[(data['No1'] > 0) & (data['No2'] < 0)].head()

A more **complex query**.

In [None]:
%%time
res = data[['No1', 'No2']][((data['No1'] > 0.5) | (data['No1'] < -0.5))
                     & ((data['No2'] < -1) | (data['No2'] > 1))]

In [None]:
plt.plot(res.No1, res.No2, 'ro')
plt.grid(True); plt.axis('tight')

**Writing the data** to disk with `pandas`.

In [None]:
h5s = pd.HDFStore(filename + '.h5s', 'w')

In [None]:
%time h5s['data'] = data

In [None]:
h5s

In [None]:
h5s.close()

Again, **reading is even faster**.

In [None]:
%%time
h5s = pd.HDFStore(filename + '.h5s', 'r')
temp = h5s['data']
h5s.close()

A brief check whether the data sets are indeed the same.

In [None]:
temp is data

In [None]:
np.allclose(np.array(temp), np.array(data))

In [None]:
temp = 0.0

Also a look at the two files now on disk, showing that the `HDF5` format consumes somewhat less disk space.

In [None]:
ll $path

### Data as CSV File

`pandas` is pretty good ad processing `CSV` files.

In [None]:
%time data.to_csv(filename + '.csv')

In [None]:
%%time
pd.read_csv(filename + '.csv')[['No1', 'No2',
                                'No3', 'No4']].hist(bins=20)

### Data as Excel File

The same holds true for **Excel spreadsheet files** &ndash; however, performance is not too good with this format.

In [None]:
%time data[:10000].to_excel(filename + '.xlsx')

In [None]:
%time pd.read_excel(filename + '.xlsx', 'Sheet1').cumsum().plot()

Comparing **file sizes**.

In [None]:
ll $path/numb*

In [None]:
rm -f $path/*

#### EXERCISE: Restrieving and storing stock price data

Using `pandas`, retrieve and save in HDF5 format the stock price information for the following stocks (starting 1.1.2005):

* Yahoo
* Microsoft
* Apple

Write all the stock price data also to an Excel spreadsheet.

## Fast I/O with PyTables

`PyTables` is a Python binding for the **`HDF5` database/file standard** (cf. http://www.hdfgroup.org). It is specifically designed to optimize the performance of I/O operations and making best use of the available hardware.

In [None]:
import numpy as np
import tables as tb
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline

### Working with Tables

`PyTables` provides a **file-based database format**.

In [None]:
filename = path + 'tab.h5'
h5 = tb.open_file(filename, 'w') 

In [None]:
rows = 2000000

The table itself has a `datetime` column, two `int` columns and two `float` columns.

In [None]:
row_des = {
    'Date': tb.StringCol(26, pos=1),
    'No1': tb.IntCol(pos=2),
    'No2': tb.IntCol(pos=3),
    'No3': tb.Float64Col(pos=4),
    'No4': tb.Float64Col(pos=5)
    }

When creating the table, we choose **no compression** for the moment.

In [None]:
filters = tb.Filters(complevel=0)  # no compression
tab = h5.create_table('/', 'ints_floats', row_des,
                      title='Integers and Floats',
                      expectedrows=rows, filters=filters)

In [None]:
tab

Now generate the **sample data**.

In [None]:
ran_int = np.random.randint(0, 10000, size=(rows, 2))
ran_flo = np.random.standard_normal((rows, 2)).round(5)

The sample data set is **written row-by-row** to the table.

In [None]:
pointer = tab.row

In [None]:
%%time
for i in range(rows):
    pointer['Date'] = dt.datetime.now()
    pointer['No1'] = ran_int[i, 0]
    pointer['No2'] = ran_int[i, 1] 
    pointer['No3'] = ran_flo[i, 0]
    pointer['No4'] = ran_flo[i, 1] 
    pointer.append()
      # this appends the data and
      # moves the pointer one row forward
tab.flush()

The **table object** after writing the data.

In [None]:
tab

In [None]:
ll $path

There is a **more performant and Pythonic way** to accomplish the same result: by the use of `NumPy` structured arrays.

In [None]:
dty = np.dtype([('Date', 'S26'), ('No1', '<i4'), ('No2', '<i4'),
                                 ('No3', '<f8'), ('No4', '<f8')])
sarray = np.zeros(len(ran_int), dtype=dty)

In [None]:
sarray

In [None]:
%%time
sarray['Date'] = dt.datetime.now()
sarray['No1'] = ran_int[:, 0]
sarray['No2'] = ran_int[:, 1]
sarray['No3'] = ran_flo[:, 0]
sarray['No4'] = ran_flo[:, 1]

Instead of the row description, just **provide this sturctured array** to create the table.

In [None]:
%%time
h5.create_table('/', 'ints_floats_from_array', sarray,
                      title='Integers and Floats',
                      expectedrows=rows, filters=filters)

Generating the same result, this apporach is obviously **much faster**.

In [None]:
h5

We **delete the dublicate table** since it is no longer needed.

In [None]:
h5.remove_node('/', 'ints_floats_from_array')

The `Table` object behaves like typical Python and `NumPy` objects when it comes to slicing, for example.

In [None]:
tab[:3]

In [None]:
tab[:4]['No4']

Even more convenient and important: you can **apply universal functions** to tables or sub-sets of the table.

In [None]:
%time np.sum(tab[:]['No3'])

In [None]:
%time np.sum(np.sqrt(tab[:]['No1']))

The `Table` object behaves also very similar when it comes to **plotting**.

In [None]:
%%time
plt.hist(tab[:]['No3'], bins=30)
plt.grid(True)
print len(tab[:]['No3'])

`PyTables` is also able to perform **(out-of-memory) analytics** of the types seen before.

In [None]:
%%time
res = np.array([(row['No3'], row['No4']) for row in
        tab.where('((No3 < -0.5) | (No3 > 0.5)) \
                 & ((No4 < -1) | (No4 > 1))')])[::100]

In [None]:
plt.plot(res.T[0], res.T[1], 'ro')
plt.grid(True)

As the following example shows, working with data stored in `PyTables` as a `Table` object makes you feel like you are working in-memory.

In [None]:
%%time
values = tab.cols.No3[:]
print "Max %18.3f" % values.max()
print "Ave %18.3f" % values.mean()
print "Min %18.3f" % values.min()
print "Std %18.3f" % values.std()

### Working with Compressed Tables

`PyTables` makes **compression of data** straightforward.

In [None]:
filename = path + 'tab.h5c'
h5c = tb.open_file(filename, 'w') 

In [None]:
filters = tb.Filters(complevel=4, complib='blosc')

In [None]:
%%time
tabc = h5c.create_table('/', 'ints_floats', sarray,
                        title='Integers and Floats',
                      expectedrows=rows, filters=filters)

For example, **analytics remains the same** &ndash; no matter if data is compressed or not.

In [None]:
%%time
res = np.array([(row['No3'], row['No4']) for row in
             tabc.where('((No3 < -0.5) | (No3 > 0.5)) \
                       & ((No4 < -1) | (No4 > 1))')])[::100]

Compression can both **increase and decrease performance** of I/O operations.

In [None]:
%time arr_non = tab.read()

In [None]:
%time arr_com = tabc.read()

In this case, working with compression takes longer. But, we realize a **compression ratio of 20%**.

In [None]:
ll $path

In [None]:
h5c.close()

### Working with Arrays

`PyTables` support the **performant storage and retrieval of `NumPy` arrays**.

In [None]:
%%time
arr_int = h5.create_array('/', 'integers', ran_int)
arr_flo = h5.create_array('/', 'floats', ran_flo)

The result.

In [None]:
h5

In [None]:
h5.close()

In [None]:
!rm -f $path/*

### Out-of-Memory Computations

`PyTables` supports **out-of-memory (numerical) operations** which makes it possible to implement array-based computations that do not fit into the memory.

In [None]:
import numpy as np
import tables as tb

In [None]:
filename = path + 'array.h5'
h5 = tb.open_file(filename, 'w') 

We create an `EArray` object that is **extendable in the first dimension** and has fixed width of 1,000 in the second dimension.

In [None]:
n = 100
ear = h5.create_earray(h5.root, 'ear',
                      atom=tb.Float64Atom(),
                      shape=(0, n))

Since it is extendable, such an object can be **populated chunk-wise**.

In [None]:
%%time
rand = np.random.standard_normal((n, n))
for _ in xrange(7500):
    ear.append(rand)
ear.flush()

The `EArray` object is **600 MB large**.

In [None]:
ear

In [None]:
ear.size_on_disk

For an out-of-memory computation, we need a **target `EArray` object** in the database.

In [None]:
out = h5.create_earray(h5.root, 'out',
                      atom=tb.Float64Atom(),
                      shape=(0, n))

The **numerical expression** to be evaluated.

In [None]:
# the numerical expression as a string object
expr = tb.Expr('3 * sin(ear) + sqrt(abs(ear))')
# target to store results is disk-based array
expr.set_output(out, append_mode=True)

And **the evaluation**.

In [None]:
%time expr.eval()
  # evaluation of the numerical expression
  # and storage of results in disk-based array

In [None]:
out[0, :10]

The same **in-memory** ...

In [None]:
%time imarray = ear.read()
  # read whole array into memory

In [None]:
import numexpr as ne
expr = '3 * sin(imarray) + sqrt(abs(imarray))'

... takes a similar amount of time (after having read the data into memory).

In [None]:
ne.set_num_threads(1)
%time ne.evaluate(expr)[0, :10]

In [None]:
ne.set_num_threads(4)
%time ne.evaluate(expr)[0, :10]

In [None]:
h5.close()

In [None]:
!rm -f $path/*

## Note of Caution

All these (performance/speed) numbers heavily depend on the **infrastructure** used (CPU [cores, clock speed], cache sizes, bus speed, RAM, HDD/SSD, etc.). 

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="mailto:yves@tpq.io">yves@tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="http://hilpisch.com" target="_blank">http://hilpisch.com</a> 

**Quant Platform** &mdash; <a href="http://quant-platform.com" target="_blank">http://quant-platform.com</a>

**Python for Finance** &mdash; <a href="http://python-for-finance.com" target="_blank">http://python-for-finance.com</a>

**Derivatives Analytics with Python** &mdash; <a href="http://derivatives-analytics-with-python.com" target="_blank">http://derivatives-analytics-with-python.com</a>

**Python Trainings** &mdash; <a href="http://training.tpq.io" target="_blank">http://training.tpq.io</a>