# I ❤ parquet

Parquet is my favorite file format for large amounts of columnar data.  Here are a few patterns to help you get started!  The [Apache parquet docs](https://arrow.apache.org/docs/python/parquet.html) are also helpful.

In [None]:
import pyarrow as pa
import pyarrow.parquet as pq

When reading parquet files I like to use `pq.read_table`.  You can also use `pd.read_table` as shown in the [competition introduction notebook](https://www.kaggle.com/jiashenliu/introduction-to-financial-concepts-and-data]).

In [None]:
table = pq.read_table('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0/c439ef22282f412ba39e9137a3fdabac.parquet')
df = table.to_pandas()
df.head()


The competition has been structured with *partitioned* datasets.  That means the value for `stock_id` is in the path and not in the file.  Its called the *partition key*.  The `pyarrow.parquet` library has a `ParquetDataset` that takes advantage of partitions.

In [None]:
%%time
dataset = pq.ParquetDataset('../input/optiver-realized-volatility-prediction/trade_train.parquet/')  
table = dataset.read()
trades = table.to_pandas()
trades.info()

We just loaded ALL the trades for every stock!  The *partition key* became the categorical column `stock_id` automatically! You can do this for the book data too.

In [None]:
dataset = pq.ParquetDataset('../input/optiver-realized-volatility-prediction/book_train.parquet/') 
books = dataset.read()
books = books.to_pandas()  # I overwrite the pyarrow table object here to save memory
books.info()

In [None]:
print(f'Found {books.time_id.nunique()} unique time ids')
print(f'Found {books.stock_id.nunique()} unique stock ids.')

My kernel shows 7 out of 16GB used which might cause problems when you start unleashing your pandas magic.  One way to save memory, which you already know, is to load one stock at a time.  But you might not know there's a way to get a single stock using the `filters` option!

In [None]:
dataset = pq.ParquetDataset('../input/optiver-realized-volatility-prediction/book_train.parquet/', 
                            filters =[('stock_id', '=', '5')]) 
table = dataset.read()
stock = table.to_pandas()
print(f'Found {stock.time_id.nunique()} unique time ids')
print(f'Found {stock.stock_id.nunique()} unique stock ids.')

You could have also iterated though the files, but this way the `stock_id` column is already in the dataframe.  The `filter` option can be used to select rows for any column.  Here we get every stock for `time_id=5`.

In [None]:
dataset = pq.ParquetDataset('../input/optiver-realized-volatility-prediction/book_train.parquet/', 
                            use_legacy_dataset=False,
                            filters =[('time_id', '=', '5')]) 
table = dataset.read()
stock = table.to_pandas()
print(f'Found {stock.time_id.nunique()} unique time ids')
print(f'Found {stock.stock_id.nunique()} unique stock ids.')

Note that I used the option `use_legacy_dataset=False`.  This is required to filter on something other than *partition keys*. 