![Top <](./images/watsonxdata.png "watsonxdata")

# Convert CSV to Parquet

This sample code takes an input CSV file and converts to Parquet format. At the same time it generates the SQL required to catalog the table in the watsonx.data database. 

Update the filenames below to reflect the CSV input file and the name of the parquet file that it generates. Note that this code assumes that there is a header for the CSV file.

In [None]:
import pyarrow as pa
import pandas as pd

csv_in      = "/sampledata/csv/taxi/taxi.csv"
parquet_out = "/tmp/taxi.parquet"

Enter the details of the catalog, schema, table name and the location of the S3 bucket that contains the file.

In [None]:
catalog     = "hive_data"
schema      = "ontime"
table       = "ontime"
bucket      = "s3a://hive-bucket/ontime/ontime"

This code makes some assumptions about the data type conversion that will be used when taking the CSV file and creating the parquet file. You may need to adjust the column types when the SQL is generated. Note that this code does not execute the SQL that is produced. The assumption is that you will take the generated SQL and run it in the watsonx.data UI.

In [None]:
dfValue = pd.read_csv(csv_in,na_values="-")
dfValue = dfValue.fillna(0)

columns = dict(dfValue.dtypes)
column_to_type = {}

for column in columns:

    datatype = str(columns[column])
    datatype = datatype.upper()

    if (datatype == "OBJECT"):
        type = "string" 
    elif (datatype == "INT64"):
        type = "int64"		
    elif (datatype == "FLOAT64"):
        type = "double"
    elif ("DATETIME64" in datatype):
        type = "timestamp"
    elif (datatype == "BOOL"):
        type = "binary"
    else:
        type = "string"  

    column_to_type.update({column:type})

dfValue = dfValue.astype(column_to_type)
dfValue.to_parquet(parquet_out)

sql = f'CREATE TABLE IF NOT EXISTS "{catalog}"."{schema}"."{table}" (\n'
first_line = True
for key in column_to_type.keys():
    if (first_line == False):
        sql = sql + ",\n"
    first_line = False
    column_definition = f'"{key}" {column_to_type[key]}'
    sql = sql + column_definition
sql = sql + '\n)\n'
sql = sql + f"WITH (format='PARQUET',external_location='{bucket}');"
print(sql)

We can double check to see that the parquet file has been produced.

In [None]:
%system ls -al {parquet_out}

#### Credits: IBM 2025, George Baklarz [baklarz@ca.ibm.com]