Simple Awk-based SQL-like query language for CSVs.
This small project was created as a convenience tool to easily and quickly examine large CSV or ECSV files from ESA's Gaia space astrometry mission.
It aims at providing basic SQL-like capabilities directly from the command line, yet with a slightly peculiar syntax.
It was created before we knew about DuckDB and its amazing functionalities and performance on CSV files, so this tool is probably a bit useless now! Nevertheless, it's quite fun to see the kind of things that you can do just with shell scripting and Awk.
This tool doesn't do much error checks (if any at all), so you should double-check your commands and perhaps start with small tests before launching it on large files.
The main program is awkql.sh, and actually there's not much more than this (just the ancillary cols2nums.awk program).
It processes just one CSV file in single-thread.
Its usage is:
./awkql.sh select <'a , b , c...'|'count'|"\$0"> from <file.csv> where <condition>
Please note that the column names, either in the 'select' or in the 'where', must be surrounded by blank spaces. Otherwise the tool won't recognize them.
The output is written to the console, so you can redirect it to a file.
Some examples:
-
Just "describe" the file by listing its column names and also showing just the very first record:
./awkql.sh desc GaiaSource_005264-006601.csv -
Count the records from a CSV file that fulfil certain conditions:
./awkql.sh select 'count' from GaiaSource_005264-006601.csv where '(( astrometric_params_solved >= 31) && ( parallax_over_error >= 10))' -
Get some fields, apply operations on them, and just for records fulfilling certain condition - and store the output in a file:
./awkql.sh select 'phot_g_mean_mag + 5* (log( parallax )/log(10.0) + 1), bp_rp' from GaiaSource_005264-006601.csv where '(( astrometric_params_solved >= 31) && ( parallax > 0 ) && ( parallax_over_error >= 3))' > absmag.csv
Note how we carefully leave blank spaces around the column names, and also how we do NOT leave any space in at least one side of the multiplication (so that bash doesn't expand it as a wildcard).
-
Get all of the fields from records which have a specific column contained in a file:
./awkql.sh select "\$0" from GaiaSource_005264-006601.csv where '( source_id in sourcelist.txt )'