ParquetSharp.Dataset

This is a work in progress and is not yet ready for public use

ParquetSharp.Dataset supports reading datasets consisting of multiple Parquet files, which may be partitioned with a partitioning strategy such as Hive partitioning. Data is read using the Apache Arrow format.

Note that ParquetSharp.Dataset does not use the Apache Arrow C++ Dataset library, but is implemented on top of ParquetSharp, which uses the Apache Arrow C++ Parquet library.

Usage

To begin with, you will need a dataset of Parquet files that have the same schema:

/my-dataset/data0.parquet
/my-dataset/data1.parquet

You can then create a DatasetReader, and read data from this as a stream of Arrow RecordBatch:

using ParquetSharp.Dataset;

var dataset = new DatasetReader("/my-dataset");
using var arrayStream = dataset.ToBatches();
while (await reader.ReadNextRecordBatchAsync() is { } batch)
{
    using (batch)
    {
        // Use data in the batch
    }
}

Your dataset may be partitioned using Hive partitioning, where directories are named containing a field name and value:

/my-dataset/part=a/data0.parquet
/my-dataset/part=a/data1.parquet
/my-dataset/part=b/data0.parquet
/my-dataset/part=b/data1.parquet

To read Hive partitioned data, you can provide a HivePartitioning.Factory instance to the DatasetReader constructor, and the partitioning schema will be inferred by looking at the dataset directory structure:

var partitioningFactory = new HivePartitioning.Factory();
var dataset = new DatasetReader("/my-dataset", partitioningFactory);

Alternatively, you can specify the partitioning schema explicitly:

var partitioningSchema = new Apache.Arrow.Schema.Builder()
    .Field(new Field("part", new StringType(), nullable: false))
    .Build());
var partitioning = new HivePartitioning(partitioningSchema);
var dataset = new DatasetReader("/my-dataset", partitioning);

When creating a DatasetReader, the schema from the first Parquet file found will be inspected to determine the full dataset schema. This can be avoided by providing the full dataset schema explicitly:

var datasetSchema = new Apache.Arrow.Schema.Builder()
    .Field(new Field("part", new StringType(), nullable: false))
    .Field(new Field("x", new Int32Type(), nullable: false))
    .Field(new Field("y", new FloatType(), nullable: false))
    .Build());
var dataset = new DatasetReader("/my-dataset", partitioning, datasetSchema);

Filtering data

When reading data from a dataset, you can specify the columns to include and filter rows based on field values. Row filters may apply to fields from data files or from the partitioning schema. When a filter excludes a partition directory no files from that directory will be read.

var columns = new[] {"x", "y"};
var filter = Col.Named("part").IsIn(new[] {"a", "c"});
using var arrayStream = dataset.ToBatches(filter, columns);
while (await reader.ReadNextRecordBatchAsync() is { } batch)
{
    using (batch)
    {
        // batch will only contain columns "x" and "y",
        // and only files in the selected partitions will be read.
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.config		.config
.github		.github
ParquetSharp.Dataset.Benchmark		ParquetSharp.Dataset.Benchmark
ParquetSharp.Dataset.Test		ParquetSharp.Dataset.Test
ParquetSharp.Dataset		ParquetSharp.Dataset
.gitignore		.gitignore
LICENSE		LICENSE
ParquetSharp.Dataset.DotSettings		ParquetSharp.Dataset.DotSettings
ParquetSharp.Dataset.sln		ParquetSharp.Dataset.sln
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.config

.config

.github

.github