# Parquet Files (Optional)

As we mentioned in <a href="Data File Types.ipynb">Data File Types</a>, `.parquet` files are a useful data file type. They can help improve query speed, decrease memory requirements, and speed up column based calculations.

## What we will accomplish

In this notebook we will:
- Dive a little deeper into the `.parquet` file type,
- Learn a bit more about how `.parquet` files improve query performance,
- Talk about the `pyarrow` package, and
- Demonstrate concepts with an actual `.parquet` file.

In [None]:
import pandas as pd

In the following notebook we will use a fictional data set, `study_results`, that contains the results of a study examining the effectiveness of three treatments on reducing recovery time in adults. We can load the `.csv` version of these data with `read_csv`. We will eventually look at a parquet equivalent of this file later in the notebook.

In [None]:
df_csv = pd.read_csv("../data/study_results.csv")

df_csv.head()

In [None]:
len(df_csv)

## Row format compared to columnar format

Recall that file types like `.csv`s and `.tsv`s store data in a row-based format like so:

<img src="row_based.png" width="40%"></img>

and `.parquet` files store their data in a columnar format like so:

<img src="columnar.png" width="60%"></img>

Above, this columnar format was hidden from us, in a sense, by `pandas` because `read_parquet` automatically converts the `.parquet` file into a `pandas` `DataFrame` which has a row-based approach to storing data. However, the columnar approach gives `.parquet` files its most desirable quality for data analysts and scientists, increased column querying speed.

### Increasing column query speed

Many real world applications have data sets with many columns (think hundreds or more), but you will commonly only want to use relatively few columns for any given analysis or application. Row-based formats are slow to subselect a set of columns because you need to traverse each row (of which there may be millions) and select the appropriate values. By contrast, with columnar formats you only need to traverse the set of columns and choose the ones you want, this is much faster because there are typically much fewer columns than rows.

## Actually directories

Up to this point we have thought of parquet files as single files. While there are distinct `.parquet` files, it is more common in practice to deal with the parquet format as a directory of `.parquet` files.

### Partitioning

For a given data set a single `.parquet` file tends to only store a subset of the data. This subset is formed through partitioning. For example, the study data are split according to sex and then according to treatment group as described in the schematic below.


<img src="partition.png" width="20%"></img>

We can implement such a partitioning using `to_parquet` along with the `partition_cols` argument which takes in a `list` of column names along which the data are partitioned. Let's do that now with our `study_data.csv` data.

In [None]:
## Note that when we do this we are naming a directory
## NOT a file, so there is no .parquet file at the end
## of the file name
df_csv.to_parquet()

Now if you go to the `data` folder you will see the folder `study_data_parquet` in there. Inside that folder are `sex=M` and `sex=F` folders, each of which contain `treatment=A`, `treatment=B`, `treatment=C` folders as demonstrated in the sequential images below:

In the `data` folder:
<img src="study_parq_folder_1.png" width="90%"></img>

In the `data/study_data_parquet` folder:
<img src="study_parq_folder_2.png" width="90%"></img>

In the `data/study_data_parquet/sex=F` folder:
<img src="study_parq_folder_3.png" width="90%"></img>

In the `data/study_data_parquet/sex=F/treatment=A` folder:
<img src="study_parq_folder_4.png" width="90%"></img>

Even though the parquet "file" is a directory you can still read it in using `read_parquet`.

In [None]:
df_parq = 

In [None]:
df_parq

However, as we can see, this loads the entire data set as a single table, which may be undesirable if the data set is quite large. Luckily, there are alternative ways to loading a parquet file in python.

## `pyarrow`

One way to take greater advantage of the benefits of parquet is by using the `pyarrow` package, <a href="https://arrow.apache.org/docs/python/parquet.html">https://arrow.apache.org/docs/python/parquet.html</a>, directly. First we need to import `parquet` from `pyarrow`

In [None]:
## Import parquet as pq


Next we can load the parquet directory with `pq.ParquetDataset`.

In [None]:
## You can read in a directory with
## pq.ParquetDataset(directory name)
study_directory = 

We can see how the parquet directory was partitioned using `.partitioning.dictionaries`.

You can get the data using the `.read()` function. You can then turn it into a `pandas` `DataFrame` using `.to_pandas()`.

In [None]:
## Creating a pyarrow Table object


In [None]:
## Turning a pyarrow table into a dataframe


If we want to filter the data before it is loaded, we can add a `filters` argument to `pq.ParquetDataset`.

In [None]:
## filters takes in a list of tuples
## each tuple has a logic condition
## (the column string, the logical comparison, the value for comparison)
## subsetting is performed in order of the list
study_directory_F = pq.ParquetDataset("../data/study_data_parquet/",
                                         filters=)

In [None]:
## examine the filter in action


There are also ways to filter after you have loaded the directory, but we will touch on that in the corresponding practice problems notebook. If you are interested in learning more about how you can manage parquet files with `pyarrow`, check out their documentation:
- Goes directly to handling parquet files: <a href="https://arrow.apache.org/docs/python/parquet.html">https://arrow.apache.org/docs/python/parquet.html</a>,
- Starts at the beginning of the documentation: <a href="https://arrow.apache.org/docs/python/getstarted.html">https://arrow.apache.org/docs/python/getstarted.html</a>.

That will be it for this notebook!

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)