---
title: Reading Parquet Datasets using Python
description: A guide to reading Parquet datasets in Python using PyArrow, Pandas, and Polars.
author: GSI Environmental Inc.
date: 2025-01-22
code-fold: false
execute:
  freeze: true
---

This page describes how to read [Parquet](https://parquet.apache.org/) datasets in Python using PyArrow, Pandas, and Polars. This page does *not* go into detail on how to perform analysis on the data once it is read into one of these libraries. For more information on how to perform analysis using one of these libraries, see the following:

- [Pandas](https://pandas.pydata.org/docs/)
- [Polars](https://pola-rs.github.io/polars-book/)
- [PyArrow](https://arrow.apache.org/docs/python/)

## Python Setup

1. Install Python version 3.10, 3.11, or 3.12.
2. Install required dependencies: `python -m pip install jupyter pyarrow polars pandas`

## Import Libraries

In [43]:
# Standard libraries
import logging
from pathlib import Path

# Third party libraries
import polars as pl
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

## Configure Logging

In [44]:
logging.basicConfig(
    level=logging.INFO, format="%(message)s"
)
logger = logging.getLogger(__name__)

## Define Variables

In [45]:
# Path to the parent folder containing the Parquet dataset folder.
# If the dataset is in the current working directory, this can be set to ".".
DATA_DIR = Path("data")
# Name of the dataset folder (i.e., "Bottle", "CTD", "Mooring", "Species_Abundance").
DATASET = Path("Bottle")

# Full path to the dataset folder.
DATASET_DIR = DATA_DIR / DATASET

## Reading Dataset Metadata

In [46]:
meta = pq.read_metadata(DATASET_DIR / "_metadata")
logger.info(f"Parquet format version: {meta.format_version}")
logger.info(f"Columns: {meta.num_columns}")
logger.info(f"Row groups: {meta.num_row_groups}")
logger.info(f"Rows: {meta.num_rows}")

for key, value in meta.metadata.items():
    if "schema" not in key.decode("utf-8"):
        logger.info(f"{key.decode('utf-8')}: {value.decode('utf-8')}")

Parquet format version: 2.6
Columns: 55
Row groups: 50
Rows: 100500
created_at: 2025-01-22T12:17:58.319666
author: GSI Environmental Inc.
dataset_name: Bottle


## Reading the Dataset with PyArrow

[PyArrow reference](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html)

In [47]:
ds = pq.ParquetDataset(DATASET_DIR)

logger.info(f"Files: {len(ds.files)}")
columns = ds.schema.names
dtypes = ds.schema.types
column_dtypes = {column: dtype for column, dtype in zip(columns, dtypes)}
logger.info("Columns:")
for column, dtype in column_dtypes.items():
    logger.info(f"  {str(column):<20}: {dtype}")

Files: 27
Columns:
  area_id             : string
  location_id         : string
  loc_desc            : string
  loc_type            : string
  loc_geom            : string
  x_coord             : double
  y_coord             : double
  srid                : int64
  coord_sys           : string
  loc_method          : string
  provider            : string
  study_id            : string
  study_name          : string
  sample_doc          : string
  sample_date         : timestamp[us]
  coll_scheme         : string
  sample_material     : string
  sample_id           : string
  sample_desc         : string
  original_sample_id  : string
  upper_depth         : double
  lower_depth         : double
  depth_units         : string
  split_type          : string
  sample_no           : string
  lab                 : string
  lab_pkg             : string
  material            : string
  material_analyzed   : string
  labsample           : string
  lab_rep             : string
  method_code 

In [48]:
# Read the entire dataset
data = ds.read()
# Read a couple of columns
data = ds.read(columns=["area_id", "location_id", "upper_depth", "lower_depth", "depth_units", "sample_date", "analyte", "result", "units"])
# Show the data
data

pyarrow.Table
area_id: string
location_id: string
upper_depth: double
lower_depth: double
depth_units: string
sample_date: timestamp[us]
analyte: string
result: double
units: string
----
area_id: [["Puget Sound: Main Basin","Puget Sound: Main Basin","Puget Sound: Main Basin","Puget Sound: Main Basin","Puget Sound: Main Basin",...,"Puget Sound: Whidbey Basin","Puget Sound: Whidbey Basin","Puget Sound: Whidbey Basin","Puget Sound: Whidbey Basin","Puget Sound: Whidbey Basin"],["Puget Sound: Whidbey Basin","Puget Sound: Whidbey Basin","Puget Sound: Whidbey Basin","Puget Sound: Whidbey Basin","Puget Sound: Whidbey Basin",...,"Puget Sound: Hood Canal","Puget Sound: Hood Canal","Puget Sound: Hood Canal","Puget Sound: Hood Canal","Puget Sound: Hood Canal"],...,["Puget Sound: Admiralty Inlet","Puget Sound: Admiralty Inlet","Puget Sound: Admiralty Inlet","Puget Sound: Admiralty Inlet","Puget Sound: Admiralty Inlet",...,"Puget Sound: Admiralty Inlet","Puget Sound: Admiralty Inlet","Puget Sound: A

In [49]:
# Optionally, convert to a pandas DataFrame
df = data.to_pandas()
# Show the first few rows
df.head()

Unnamed: 0,area_id,location_id,upper_depth,lower_depth,depth_units,sample_date,analyte,result,units
0,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Nitrate,11.77,uM
1,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Nitrite,0.63,uM
2,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Phosphate,1.41,uM
3,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Ammonium,2.56,uM
4,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Chlorophyll,2.5,ug/L


## Reading the Dataset with Pandas

[Pandas reference](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html#pandas-read-parquet)

In [50]:
# Convert the data to a Pandas DataFrame
df = pd.read_parquet(DATASET_DIR, engine="pyarrow")
# Read a couple of columns
df = pd.read_parquet(DATASET_DIR, columns=["area_id", "location_id", "upper_depth", "lower_depth", "depth_units", "sample_date", "analyte", "result", "units"], engine="pyarrow")
# Show the first few rows
df.head()

Unnamed: 0,area_id,location_id,upper_depth,lower_depth,depth_units,sample_date,analyte,result,units
0,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Nitrate,11.77,uM
1,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Nitrite,0.63,uM
2,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Phosphate,1.41,uM
3,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Ammonium,2.56,uM
4,Puget Sound: Main Basin,UW_P31,2.12,2.12,m,1999-06-15 08:39:43,Chlorophyll,2.5,ug/L


## Reading the Dataset with Polars

[Polars reference](https://docs.pola.rs/api/python/dev/reference/api/polars.read_parquet.html)

In [51]:
# Read the entire dataset
df = pl.read_parquet(DATASET_DIR, use_pyarrow=True)
# Select a couple of columns
df = df.select(["area_id", "location_id", "upper_depth", "lower_depth", "depth_units", "sample_date", "analyte", "result", "units"])
# Show the first few rows
df.head()

area_id,location_id,upper_depth,lower_depth,depth_units,sample_date,analyte,result,units
str,str,f64,f64,str,datetime[μs],str,f64,str
"""Puget Sound: Main Basin""","""UW_P31""",2.12,2.12,"""m""",1999-06-15 08:39:43,"""Nitrate""",11.77,"""uM"""
"""Puget Sound: Main Basin""","""UW_P31""",2.12,2.12,"""m""",1999-06-15 08:39:43,"""Nitrite""",0.63,"""uM"""
"""Puget Sound: Main Basin""","""UW_P31""",2.12,2.12,"""m""",1999-06-15 08:39:43,"""Phosphate""",1.41,"""uM"""
"""Puget Sound: Main Basin""","""UW_P31""",2.12,2.12,"""m""",1999-06-15 08:39:43,"""Ammonium""",2.56,"""uM"""
"""Puget Sound: Main Basin""","""UW_P31""",2.12,2.12,"""m""",1999-06-15 08:39:43,"""Chlorophyll""",2.5,"""ug/L"""
