# CSV to Parquet

In this demo we'll walk through querying a CSV file from an AWS S3 bucket and saving the results locally as a Parquet file.

Identify the Dask Client (`client`) of your local GPUs, and pass it to BlazingContext (`bc`) upon initialization to activate distributed query execution with BlazingSQL.

In [1]:
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

from dask.distributed import Client
client = Client(cluster)

from blazingsql import BlazingContext
bc = BlazingContext(dask_client=client, network_interface='lo')

BlazingContext ready


Register a public AWS S3 bucket and create a table (`taxi`) from it.

In [2]:
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

col_names = ['key', 'fare', 'pickup_x', 'pickup_y', 'dropoff_x', 'dropoff_y', 'passenger_count']
bc.create_table('taxi', 's3://blazingsql-colab/taxi_data/taxi_00.csv', names=col_names)

<pyblazing.apiv2.context.BlazingTable at 0x7f462dacf750>

Tag the file path to the local directory where results will be saved as `data_dir`.

In [3]:
from os import getcwd
data_dir = getcwd().replace('/sample_use_cases', '/data')

<!-- Query the table and write the results directly `.to_parquet()`. -->

As BlazingSQL returns a distributed query's results as a dask_cudf.DataFrame, we can call write those results directly [.to_parquet()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_parquet).

In [4]:
bc.sql('SELECT * FROM taxi').to_parquet(f'{data_dir}/yellow_cab')

Create a table from that newly written file, and run a simple query to see how it looks by `.compute()`ing to a cudf.DataFrame for display.

In [5]:
bc.create_table('parquet_taxi', f'{data_dir}/yellow_cab/part.0.parquet')

bc.sql('select * from parquet_taxi').compute()

Unnamed: 0,key,fare,pickup_x,pickup_y,dropoff_x,dropoff_y,passenger_count,index
0,2012-02-02 22:30:19.0000002,8.9,-73.988703,40.758803,-73.986517,40.737205,1,0
1,2014-09-20 07:19:24.0000001,4.0,-73.990208,40.746703,-73.994729,40.750512,1,1
2,2013-02-23 07:18:05.0000001,5.5,-74.016757,40.709438,-74.009,40.719496,3,2
3,2015-04-18 23:49:27.0000009,13.5,-74.002708,40.733730,-73.98609924,40.73477554,1,3
4,2010-03-04 08:15:59.0000001,10.5,-73.988356,40.737665,-74.012459,40.713934,1,4
...,...,...,...,...,...,...,...,...
4999995,2011-02-24 16:06:26.0000001,6.9,-73.966542,40.804975,-73.949043,40.804227,2,4999995
4999996,2009-09-22 19:20:22.0000009,9.7,-73.980055,40.752535,-74.006443,40.739613,1,4999996
4999997,2012-04-19 02:17:32.0000001,14.1,-73.998508,40.745305,-73.953184,40.799361,2,4999997
4999998,2012-06-08 11:09:47.0000006,3.3,-73.953630,40.778797,-73.946068,40.775552,1,4999998


You can find the Python script version of this Notebook at [/python_scripts/csv_to_parquet.py](python_scripts/csv_to_parquet.py).