# CSV to Parquet

In this demo we'll walk through querying a CSV file from an AWS S3 bucket and saving the results locally as a Parquet file.

Identify the Dask Client (`client`) of your local GPUs, and pass it to BlazingContext (`bc`) upon initialization to activate distributed query execution with BlazingSQL.

In [None]:
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

from dask.distributed import Client
client = Client(cluster)

from blazingsql import BlazingContext
bc = BlazingContext(dask_client=client, network_interface='lo')

Register a public AWS S3 bucket and create a table (`taxi`) from it.

In [18]:
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

bc.create_table('taxi', 's3://blazingsql-colab/yellow_taxi/1_0_0.parquet')

<pyblazing.apiv2.context.BlazingTable at 0x7fb65c34a0d0>

Tag the file path to the local directory where results will be saved as `data_dir`.

In [29]:
from os import getcwd
data_dir = getcwd().replace('/sample_use_cases', '/data')

<!-- Query the table and write the results directly `.to_parquet()`. -->

As BlazingSQL returns a distributed query's results as a dask_cudf.DataFrame, we can call write those results directly [.to_parquet()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_parquet).

In [31]:
bc.sql('SELECT * FROM taxi').to_parquet(f'{data_dir}/sample_taxi')

Create a table from that newly written file, and run a simple query to see how it looks by `.compute()`ing to a cudf.DataFrame for display.

In [23]:
bc.create_table('parquet_taxi', f'{data_dir}/sample_taxi/part.0.parquet')

bc.sql('select * from parquet_taxi').compute()

Unnamed: 0,vendor_id,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,rate_code_id,store_and_fwd_flag,pickup_loc_id,dropoff_loc_id,payment_type,Fare_amount,Extra,MTA_tax,Improvement_surcharge,Tip_amount,Tolls_amount,Total_amount,index
0,1,2017-01-09 11:13:28,2017-01-09 11:25:45,1,3.300000,1,N,263,161,1,12.5,0.0,0.5,2.00,0.0,0.3,15.300000,0
1,1,2017-01-09 11:32:27,2017-01-09 11:36:01,1,0.900000,1,N,186,234,1,5.0,0.0,0.5,1.45,0.0,0.3,7.250000,1
2,1,2017-01-09 11:38:20,2017-01-09 11:42:05,1,1.100000,1,N,164,161,1,5.5,0.0,0.5,1.00,0.0,0.3,7.300000,2
3,1,2017-01-09 11:52:13,2017-01-09 11:57:36,1,1.100000,1,N,236,75,1,6.0,0.0,0.5,1.70,0.0,0.3,8.500000,3
4,2,2017-01-01 00:00:00,2017-01-01 00:00:00,1,0.020000,2,N,249,234,2,52.0,0.0,0.5,0.00,0.0,0.3,52.799999,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18826826,1,2018-05-01 16:31:07,2018-05-01 16:43:07,0,1.300000,1,N,161,234,1,9.0,1.0,0.5,2.15,0.0,0.3,12.950000,18826826
18826827,1,2018-05-01 16:44:57,2018-05-01 17:11:43,0,4.300000,1,N,234,13,1,19.5,1.0,0.5,5.30,0.0,0.3,26.600000,18826827
18826828,2,2018-05-01 16:10:21,2018-05-01 16:27:40,2,4.020000,1,N,262,234,1,16.0,1.0,0.5,1.80,0.0,0.3,19.600000,18826828
18826829,2,2018-05-01 16:45:53,2018-05-01 17:39:44,1,17.700001,2,N,132,148,1,52.0,4.5,0.5,14.32,0.0,0.3,71.620003,18826829


You can find the Python script version of this Notebook at [/python_scripts/aws_s3_to_parquet.py](/python_scripts/aws_s3_to_parquet.py).