[Apache Parquet](https://arrow.apache.org/docs/python/parquet.html) is an efficient columnar storage format. Compared to saving this dataset in csvs using parquet:
- Greatly reduces the necessary disk space
- Loads the data into Pandas with memory efficient datatypes
- Enables fast reads from disk
- Allows us to easily work with partitions of the data

Pandas has a parquet integration that makes loading data into a dataframe trivial; we'll try that now.

In [1]:
import pandas as pd

In [2]:
import os
os.chdir('C:\coursera\Optiver-Realized-Volatility-Prediction\data')

In [3]:
book_train = pd.read_parquet('book_train.parquet/stock_id=0')
book_train

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

In [4]:
import glob
subset_paths = glob.glob('book_train.parquet/stock_id=*')
for path in subset_paths:
    print(type(path))
    book_train = pd.read_parquet(path)
    book_train.time_id.hist()
    break

<class 'str'>


ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

In [5]:
import pymc3 as pm
import scipy

Y = scipy.stats.bernoulli(0.7).rvs(20)

# Declare a model in PyMC3
with pm.Model() as model:
    # Specify the prior distribution of unknown parameter
    θ = pm.Beta("θ", alpha=1, beta=1)

    # Specify the likelihood distribution and condition on the observed data
    y_obs = pm.Binomial("y_obs", n=1, p=θ, observed=Y)

    # Sample from the posterior distri bution
    idata = pm.sample(1000, return_inferencedata=True)

WARN: Could not locate executable g77
WARN: Could not locate executable f77
WARN: Could not locate executable ifort
WARN: Could not locate executable ifl
WARN: Could not locate executable f90
WARN: Could not locate executable DF
WARN: Could not locate executable efl




ImportError: DLL load failed while importing mf6917bb35eaa79d4a20c20b9dc13c0435e656b1bdb67265fed3d06258ff43ef9: The specified module could not be found.

In [None]:
import scipy

scipy.stats.uniform(0, 1).rvs(1)
Y = scipy.stats.bernoulli(0.7).rvs(20)

scipy.stats.beta(2, 5).pdf(0.7)*scipy.stats.bernoulli(0.7).pmf(Y).prod()

6.620244760247978e-08

In [30]:
trade_train = pd.read_parquet('trade_train.parquet/stock_id=0')
trade_train

# import glob
# subset_paths = glob.glob('trade_train.parquet/stock_id=*')
# for path in subset_paths:
#     print(type(path))
#     trade_train = pd.read_parquet(path)
#     trade_train.time_id.plot()

Unnamed: 0,time_id,seconds_in_bucket,price,size,order_count
0,5,21,1.002301,326,12
1,5,46,1.002778,128,4
2,5,50,1.002818,55,1
3,5,57,1.003155,121,5
4,5,68,1.003646,4,1
...,...,...,...,...,...
123438,32767,471,0.998659,200,3
123439,32767,517,0.998515,90,1
123440,32767,523,0.998563,1,1
123441,32767,542,0.998803,90,4


If this data were stored as a csv, the numeric types would all default to the 64 bit versions. Parquet retains the more efficient types I specified while saving the data.

**Expect memory usage to spike to roughly double the final dataframe size while parquet loads a file. Consider loading your largest dataset first or using partitions to mitigate this.**

In [14]:
book_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   time_id            3 non-null      int16   
 1   seconds_in_bucket  3 non-null      int16   
 2   bid_price1         3 non-null      float32 
 3   ask_price1         3 non-null      float32 
 4   bid_price2         3 non-null      float32 
 5   ask_price2         3 non-null      float32 
 6   bid_size1          3 non-null      int32   
 7   ask_size1          3 non-null      int32   
 8   bid_size2          3 non-null      int32   
 9   ask_size2          3 non-null      int32   
 10  stock_id           3 non-null      category
dtypes: category(1), float32(4), int16(2), int32(4)
memory usage: 359.0 bytes


The one exception is the `stock_id` column, which has been converted to the category type as it is [the partition column](https://arrow.apache.org/docs/python/parquet.html#reading-from-partitioned-datasets). The parquet files in this dataset are all paritioned by `stock_id` so that it's not necessary to load the entire file at once. In fact, if you examine the parquet files you'll see that they are actually directories.

In [16]:
! cd ../input/optiver-realized-volatility-prediction/book_train.parquet | head -n 5

'head' is not recognized as an internal or external command,
operable program or batch file.


Those are in turn also directories, which would be relevant if the data were partitioned by more than one column.

In [5]:
! ls ../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0/

c439ef22282f412ba39e9137a3fdabac.parquet


In [6]:
book_train_0 = pd.read_parquet('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=0/c439ef22282f412ba39e9137a3fdabac.parquet')
book_train_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 917553 entries, 0 to 917552
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   time_id            917553 non-null  int16  
 1   seconds_in_bucket  917553 non-null  int16  
 2   bid_price1         917553 non-null  float32
 3   ask_price1         917553 non-null  float32
 4   bid_price2         917553 non-null  float32
 5   ask_price2         917553 non-null  float32
 6   bid_size1          917553 non-null  int32  
 7   ask_size1          917553 non-null  int32  
 8   bid_size2          917553 non-null  int32  
 9   ask_size2          917553 non-null  int32  
dtypes: float32(4), int16(2), int32(4)
memory usage: 31.5 MB


Note that because we loaded a single partition, **the partition column was not included**. We could remedy that manually if we need the stock ID or just load a larger subset of the data by passing a list of paths. This will load all of the stock IDs 110-119, reducing memory usesage without implicitly dropping the partition column:

In [7]:
import glob
subset_paths = glob.glob('../input/optiver-realized-volatility-prediction/book_train.parquet/stock_id=11*/*')
book_train_subset = pd.read_parquet(subset_paths)
book_train_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14006182 entries, 0 to 14006181
Data columns (total 11 columns):
 #   Column             Dtype   
---  ------             -----   
 0   time_id            int16   
 1   seconds_in_bucket  int16   
 2   bid_price1         float32 
 3   ask_price1         float32 
 4   bid_price2         float32 
 5   ask_price2         float32 
 6   bid_size1          int32   
 7   ask_size1          int32   
 8   bid_size2          int32   
 9   ask_size2          int32   
 10  stock_id           category
dtypes: category(1), float32(4), int16(2), int32(4)
memory usage: 494.2 MB
