Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
Empty values in comma-delimited data causes parsing issues #469
I have data files where each line contains a record of data delimited by only a comma. There are no commas within any of the contents of the data so there shouldn't be any escaping issues. The first 22 columns of the table look as follows:
CREATE TABLE trips ( trip_id UInt32, vendor_id String, pickup_datetime DateTime, dropoff_datetime Nullable(DateTime), store_and_fwd_flag Nullable(FixedString(1)), rate_code_id Nullable(UInt8), pickup_longitude Nullable(Float64), pickup_latitude Nullable(Float64), dropoff_longitude Nullable(Float64), dropoff_latitude Nullable(Float64), passenger_count Nullable(UInt8), trip_distance Nullable(Float64), fare_amount Nullable(Float32), extra Nullable(Float32), mta_tax Nullable(Float32), tip_amount Nullable(Float32), tolls_amount Nullable(Float32), ehail_fee Nullable(Float32), improvement_surcharge Nullable(Float32), total_amount Nullable(Float32), payment_type Nullable(String), trip_type Nullable(UInt8), ... ) ENGINE = Log;
When I attempt to import the data with the following command:
time (gunzip -c /home/mark/trips/trips_x*.csv.gz | \ clickhouse-client \ --query="INSERT INTO trips FORMAT CSV")
I'm getting an error message on one of the records:
Is there a better way to define the table schema and/or the format parameters in the INSERT INTO statement so that the missing values don't cause such an issue when importing?
CSV format doesn't treat empty as NULL to avoid ambiguity with empty strings.
Please note, that NULLs support is still shallow. We use NULLs only to import data and to convert to non-Nullable types before any further data analysis.
I suggest to use TSV (TabSeparated) format. Postgres, MySQL, Hive, etc. print NULLs as
Thanks for the explanation. I'll use
Just in case it's helpful for anyone looking at this issue in the future this is the command I'm running:
time (gunzip -c /home/mark/trips/trips_x*.csv.gz | \ sed 's/,/\t/g' | \ clickhouse-client \ --query="INSERT INTO trips FORMAT TSV")
If you could that would be great. In the mean time I've put together a Python script to transform the data:
import sys for line in sys.stdin: print ','.join([item if len(item.strip()) else '\N' for item in line.strip().split(',')])
I'm running the import now and it seems to be running well.
for filename in /home/mark/trips/trips_x*.csv.gz; do gunzip -c $filename | \ python trans.py | \ clickhouse-client \ --query="INSERT INTO trips FORMAT CSV" done
my python script