# Parquet Files (Optional)

The questions in this notebook correspond to the material covered in <a href="../lectures/Parquet Files (Optional).ipynb">Parquet Files (Optional)</a>.

In [1]:
import pandas as pd
import numpy as np

1. For this notebook imagine that you are part of a team looking at how three different interventions impacted the time it takes for a participant to finish a standardized test. The code chunk below will generate data and then place it in a `DataFrame` called `df`. The four columns of `df` are, `sex` which records the sex of the participant, `study_group` which records which of the four study groups the participant was in, `age` which records the age of the participants, and `test_time` which records how long (in minutes) it took the participant to complete the test. Run this code and feel free to examine the `DataFrame`.

In [2]:
np.random.seed(403)

sex = ["M"]*80
sex.extend(["F"]*80)
study_group = ["A", "B", "C", "D"]*40
age = np.random.randint(18, 35, 160)
test_time = np.random.randint(30, 80, 160)

df = pd.DataFrame({'sex':sex,
                     'study_group':study_group,
                     'age':age,
                     'test_time':test_time})

df

Unnamed: 0,sex,study_group,age,test_time
0,M,A,34,79
1,M,B,30,53
2,M,C,23,46
3,M,D,19,76
4,M,A,20,66
...,...,...,...,...
155,F,D,18,72
156,F,A,20,62
157,F,B,34,67
158,F,C,29,46


2. Save these data to a parquet directory partitioned by `sex` then `study_group`.

In [3]:
df.to_parquet("../data/test_time_study/", 
              partition_cols=['sex', 'study_group'],
              index=False)

3. Load the parquet directory using `pyarrow`. Examine the partition.

In [4]:
import pyarrow.parquet as pq

In [5]:
study_pq = pq.ParquetDataset("../data/test_time_study/")

In [6]:
study_pq.partitioning.dictionaries

[<pyarrow.lib.StringArray object at 0x1188435e0>
 [
   "F",
   "M"
 ],
 <pyarrow.lib.StringArray object at 0x1188437c0>
 [
   "A",
   "B",
   "C",
   "D"
 ]]

4. Load a filtered version of the directory to only include male participants over the age of 25.

In [7]:
study_pq_filt = pq.ParquetDataset("../data/test_time_study/",
                                  filters=[('sex', '=', 'M'),
                                           ('age', '>', 25)])

print(study_pq_filt.read().to_pandas().sex.value_counts())

study_pq_filt.read().to_pandas().describe()

M    132
F      0
Name: sex, dtype: int64


Unnamed: 0,age,test_time
count,132.0,132.0
mean,28.886364,55.545455
std,2.631714,14.172527
min,26.0,30.0
25%,27.0,45.25
50%,28.0,55.5
75%,31.0,67.25
max,34.0,79.0


5. You can perform a query <i>after</i> loading the directory with `pyarrow.compute`, <a href="https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.filter">https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.filter</a>. Read through the following commented code to learn how. Then try and write a query to return all male participants older than 25.

In [8]:
import pyarrow.compute as pc

In [9]:
## This defines a logical expression for the query
## pc.field specifies which column you want to use for the expression
expr = pc.field("sex") == "F"

## This filters the table
## study_pq.read() reads in the file as a table to be filtered
study_pq_F = study_pq.read().filter(expr).to_pandas()


study_pq_F.sex.value_counts()

F    240
M      0
Name: sex, dtype: int64

In [10]:
study_pq.read().filter((pc.field("sex") == "M") & (pc.field("age") > 25)).to_pandas()

Unnamed: 0,age,test_time,sex,study_group
0,34,79,M,A
1,27,41,M,A
2,32,64,M,A
3,26,58,M,A
4,33,51,M,A
...,...,...,...,...
127,26,54,M,D
128,28,64,M,D
129,32,63,M,D
130,27,52,M,D


--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)