One easy way to store data (*serialization*) is using built-in `pickle` serialization - `to_pickle`;

read such file by `pandas.read_pickle`

## ! 
`pickle` is only recommended for short-term storage - hard to guarantee the format to be stable over time (when library version changed, may switch to different format)

pandas has built-in support for two more binary data formats: **HDF5** and **MessagePack**. Some other formats are:
1. *bcolz* - A compressable column-oriented binary format based on the Blosc compression library.
2. *Feather* - A cross-language column-oriented file format I designed with R programming community's Hadley Wickham. Feather uses the *Apache Arrow* columnar memory format.

### Using HDF5 Format

HDF5 is well-known for storing large quantities of scientific array data. It is available as a C library. "HDF" stands for *hierarchical data format*.
1. HDF5 file can store multiple datasets and supporting metadata.
2. HDF5 supports on-the-fly compression with a variety of compression modes - enabling data with repeated patterns to be stored more efficiently.
3. HDF5 is good for large datasets that don't fit into memory - efficiently read and write small sections of much larger arrays.

We can access HDF5 files with *PyTables* or *h5py*, but pandas provides a **high-level interface** that simplifies storing Series and DataFrame object.

In [None]:
frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore('mydata.h5') 
# HDFStore works like a dict and handles the low-level details
store['obj1'] = frame
store['obj1_col'] = frame['a']
# Objects contained in the HDF5 file can then 
# be retrieved with the same dict-like API

HDFStore supports 2 storage schemas - `fixed` and `table`. The latter is generally slower but supports query operations using a special syntax:

In [None]:
store.put('obj2', frame, format = 'table')
# put is an explicit version of store['obj2'] = frame method
# but allows us to set other options like the storage format
store.select('obj2', where = ['index >= 10 and index <= 15'])
store.close()

In [None]:
# pandas.read_hdf gives you a shortcut to these tools
frame.to_hdf('mydata.h5', 'obj3', format = 'table')
pd.read_hdf('mydata.h5', 'obj3', where = ['index < 5'])

 - If processing data that is stored on remote servers (e.g. Amazon S3 or HDFS), use a different binary format designed for distributed storage like [Apache Parquet](http://parquet.apache.org/) may be more suitable.
 - If working large dataset locally, PyTables and h5py may suit the needs. As many data analysis problems are I/O-bound (Not CPU-bound), using HDF5 can accelerate the applications.
 - HDF5 is not a database. It is **Write-once, read-many** datasets. While data can be added to a file at any time, if multiple writers do so simultaneously, the file can become corrupted.

**Extra tip!** I/O bound (Input/Output) means the speed of reading and writing to disk, network, etc. While CPU bound means the speed of CPU. 

### Reading Microsoft Excel Files

Use `ExcelFile` and `pandas.read_excel` to read excel file - these tolls use the add-on packages `xlrd` and `openpyxl` to read XLS and XLSX files. (Remeber to install!)

In [None]:
xlsx = pd.ExcelFile('ex1.xlsx')
pd.read_excel(xlsx, 'Sheet1')
# Data stored in a sheet can then be read into DataFrame with parse

In [None]:
# alternatively we can pass the filename to pandas.read_excel
frame = pd.read_excel('ex1.xlsx', 'Sheet1')

To write pandas data to Excel format, you must first create an `ExcelWriter`, then write data to it using pandas objects' `to_excel` method:

In [None]:
writer = pd.ExcelWriter('ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
writer.save()

In [None]:
frame.to_excel('examples/ex2.xlsx') 
# pass a file path and avoid ExcelWriter