Skip to content

Prepare Parquet tables from Db-sync snapshots and blazingly fast query them using "duckdb" or "spark"

License

Notifications You must be signed in to change notification settings

Blockchain-Data-Analytics/Cardano_Mainchain_Parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract Cardano on-chain data to Parquet files

Db-sync maps on-chain data for the Cardano blockchain to PostgreSQL.

For extended use cases, instead of querying the database directly we would like to use a computer cluster to run such queries in parallel at MaxSpeed™.

We already provided an export to BigQuery on Google's cloud. But this comes with a cost.

Instead, this project let's you export Db-sync's tables to Parquet files which can be queried locally on your machine or in your local network using many computers.

We use duckdb to output Parquet files which can be queried using duckdb or Spark SQL.

snapshot

get a db-sync snapshot from: https://update-cardano-mainnet.iohk.io/cardano-db-sync/index.html

install it in a PostgreSQL db

PostgreSQL setup

create a database with name e.g. "cardanomainnet13" and a user with the same name.

also add the views from ext/data-analytics-bigquery.git/schema/views to schema analytics.

first, change the target users in the SQL files in that directory:

for F in *.sql; sed -i -e 's/db_sync_master/cardanomainnet13/g;s/db_sync_reader/cardanomainnet13/g'; done

copy to Parquet files

then run the continuous export to Parquet files (create a .pgpass file or set the password in an env. var. PGPASSWORD)

export PGHOST=host
export PGPORT=5432
export PGDATABASE=cardanomainnet13
export PGUSER=cardanomainnet13
export PGPASSFILE=.pgpass

the target path:

PARQUET_PATH=/usr/local/spark/data
TBL=block  # for example
DBFILE=${TBL}.ddb
INCREMENT=20
./duckdb ${DBFILE} < schema/${TBL}.sql
for E in `seq 0 ${INCREMENT} 500`; do
  echo "exporting ${TBL} from ${E} to $((E+${INCREMENT})) as CSV"
  psql -c "\\copy ( SELECT * FROM analytics.vw_bq_${TBL} WHERE epoch_no >= ${E} AND epoch_no < ${E} + ${INCREMENT} ) to '${TBL}_${E}.csv' csv;"
  ./duckdb -c "COPY ${TBL} FROM '${TBL}_${E}.csv';" ${DBFILE}
done
./duckdb -c "COPY ${TBL} TO '${PARQUET_PATH}/cardano_mainchain/${TBL}.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY);" ${DBFILE}

table tx_in_out

this table is very large and thus we will split it into pieces. first, generate the CSV files with the above loop. then, load groups of CSV files into Parquet files.

TBL=tx_in_out
DBFILE=${TBL}.ddb
INCREMENT=20
for E in `seq 0 ${INCREMENT} 500`; do
    if [ -e $DBFILE ]; then rm -v $DBFILE; fi
    ./duckdb ${DBFILE} < schema/${TBL}.sql
    for F in `seq $E 1 $((E+INCREMENT-1))`; do
        if [ -e ${TBL}_${F}.csv ]; then
            ./duckdb -c "COPY ${TBL} FROM '${TBL}_${F}.csv';" ${DBFILE}
        fi
    done
    ./duckdb -c "COPY ${TBL} TO '${PARQUET_PATH}/cardano_mainchain/${TBL}/${TBL}_${E}.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY);" ${DBFILE}
done

querying those files:

`duckdb -c "SELECT epoch_no, COUNT(*) FROM read_parquet('${PARQUET_PATH}/cardano_mainchain/tx_in_out/tx_in_out*.parquet') GROUP BY epoch_no ORDER BY epoch_no;"`

About

Prepare Parquet tables from Db-sync snapshots and blazingly fast query them using "duckdb" or "spark"

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages