New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty values in comma-delimited data causes parsing issues #469
Comments
CSV format doesn't treat empty as NULL to avoid ambiguity with empty strings. Please note, that NULLs support is still shallow. We use NULLs only to import data and to convert to non-Nullable types before any further data analysis. I suggest to use TSV (TabSeparated) format. Postgres, MySQL, Hive, etc. print NULLs as |
Thanks for the explanation. I'll use Just in case it's helpful for anyone looking at this issue in the future this is the command I'm running: time (gunzip -c /home/mark/trips/trips_x*.csv.gz | \
sed 's/,/\t/g' | \
clickhouse-client \
--query="INSERT INTO trips FORMAT TSV") |
Looks like I'll need to transform the blanks to |
Yes. Although we can introduce option to treat empty as NULL rather quickly (few days). |
If you could that would be great. In the mean time I've put together a Python script to transform the data: import sys
for line in sys.stdin:
print ','.join([item if len(item.strip()) else '\N'
for item in line.strip().split(',')]) I'm running the import now and it seems to be running well. for filename in /home/mark/trips/trips_x*.csv.gz; do
gunzip -c $filename | \
python trans.py | \
clickhouse-client \
--query="INSERT INTO trips FORMAT CSV"
done |
This problem occurs specifically for floating point types (Nullable(Float64), Nullable(Float32)). I don't think there is any ambiguity what empty floating point value means. Also this problem occurs for version 1.1.54343, but not for 1.1.54310. |
my python script https://gist.github.com/anjia0532/6db48b0886d91d9a663e5a9fd19f2aaa src csv
python cmd dest csv
|
Resolved in #5625 |
I have data files where each line contains a record of data delimited by only a comma. There are no commas within any of the contents of the data so there shouldn't be any escaping issues. The first 22 columns of the table look as follows:
When I attempt to import the data with the following command:
I'm getting an error message on one of the records:
Is there a better way to define the table schema and/or the format parameters in the INSERT INTO statement so that the missing values don't cause such an issue when importing?
The text was updated successfully, but these errors were encountered: