This python notebook is based on the book "Python for Finance: Mastering Data-Driven Finance" by Yves Hilpisch.

Elaborated by: Francisco Arizola

# Part III: Financial Data Science

## Chapter 9: Input/Output Operations

Stored data volumes have been increasing at a much faster pace than the typical random access memory (RAM) available even in the largest machines. This makes it necessary not only to store data to disk for permanent storage, but also to compensate for lack of sufficient RAM by swapping data from RAM to disk and back.

input/output (I/O) operations are actions that involve reading from or writing to an external source. These sources can be files on a disk, standard input/output streams (such as the console), network connections, or other external devices. Input/output (I/O) operations are therefore important tasks when it comes to finance applications and data-intensive applications in general. Often they represent the bottleneck for performance-critical computations, since I/O operations cannot typically shuffle data fast enough to the RAM and from the RAM to the disk.

In [5]:
# Import necessary libraries 
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import pickle

# Style pyplot with serif font
plt.rcParams['font.family'] = 'serif'

# 3D plotting library
from mpl_toolkits.mplot3d import Axes3D

# Uncomment the next line if you do not have cufflinks and plotly installed
!pip install cufflinks plotly

import pandas as pd
import cufflinks as cf 
import plotly.offline as plyo

# Initialize plotly to work in offline mode
plyo.init_notebook_mode(connected=True)

# Optional: if you are using Jupyter Notebook, you might want to enable inline plotting
cf.go_offline()



### Basic I/O with Python

In terms of frequency, single financial analytics tasks generally process data of not more than a couple of gigabytes (GB) in size—and this is a sweet spot for Python and the libraries of its scientific stack, such as NumPy, pandas, and PyTables. Python has built-in functions to serialize and store any object on disk and to read it from disk into RAM; apart from that, Python is strong when it comes to working with text files and SQL databases. 

In [17]:
# First, let's start by writing objects to disk

# We import gauss to generate normally distributed random numbers
from random import gauss

# Create a large list with random numbers
a = [gauss(1.5, 2) for i in range(1000000)]

# Define the path to store the data files
path = r'C:\Users\Franc\data_Yves'

# Open a file for writing in binary mode
pkl_file = open(path + 'data.pkl', 'wb')

Serialization refers to the conversion of an object (hierarchy) to a byte stream; deserialization is the opposite operation. The two major functions to serialize and deserialize Python objects are pickle.dump(), for writing objects, and pickle.load(), for loading them into memory:

In [18]:
%time pickle.dump(a, pkl_file) # Serializes the object a and saves it to the file.

CPU times: total: 0 ns
Wall time: 13.5 ms


In [19]:
# Close the file
pkl_file.close()

# Show the file on disk and its size (Windows)
!dir {path}

# Open the file for reading in binary mode (rb)
pkl_file = open(path + 'data.pkl', 'rb')

 El volumen de la unidad C es OS
 El n£mero de serie del volumen es: 38BA-EF85

 Directorio de C:\Users\Franc\data_Yves

12/04/2024  18:23    <DIR>          .
16/07/2024  12:27    <DIR>          ..
               0 archivos              0 bytes
               2 dirs  151,822,450,688 bytes libres


In [20]:
%time b = pickle.load(pkl_file) # Reads the object from disk and deserializes it

CPU times: total: 31.2 ms
Wall time: 57.1 ms


In [21]:
# Converting a and b to ndarrary objects, np.allclose() verifies that both contain the same data (numbers)
np.allclose(np.array(a), np.array(b))

True