Extract Cardano on-chain data to Parquet files

Db-sync maps on-chain data for the Cardano blockchain to PostgreSQL.

For extended use cases, instead of querying the database directly we would like to use a computer cluster to run such queries in parallel at MaxSpeed™.

We already provided an export to BigQuery on Google's cloud. But this comes with a cost.

Instead, this project let's you export Db-sync's tables to Parquet files which can be queried locally on your machine or in your local network using many computers.

We use duckdb to output Parquet files which can be queried using duckdb or Spark SQL.

snapshot

get a db-sync snapshot from: https://update-cardano-mainnet.iohk.io/cardano-db-sync/index.html

install it in a PostgreSQL db

PostgreSQL setup

create a database with name e.g. "cardanomainnet13" and a user with the same name.

also add the views from ext/data-analytics-bigquery.git/schema/views to schema analytics.

first, change the target users in the SQL files in that directory:

for F in *.sql; sed -i -e 's/db_sync_master/cardanomainnet13/g;s/db_sync_reader/cardanomainnet13/g'; done

copy to Parquet files

then run the continuous export to Parquet files (create a .pgpass file or set the password in an env. var. PGPASSWORD)

export PGHOST=host
export PGPORT=5432
export PGDATABASE=cardanomainnet13
export PGUSER=cardanomainnet13
export PGPASSFILE=.pgpass

the target path:

PARQUET_PATH=/usr/local/spark/data

TBL=block  # for example
DBFILE=${TBL}.ddb
INCREMENT=20
./duckdb ${DBFILE} < schema/${TBL}.sql
for E in `seq 0 ${INCREMENT} 500`; do
  echo "exporting ${TBL} from ${E} to $((E+${INCREMENT})) as CSV"
  psql -c "\\copy ( SELECT * FROM analytics.vw_bq_${TBL} WHERE epoch_no >= ${E} AND epoch_no < ${E} + ${INCREMENT} ) to '${TBL}_${E}.csv' csv;"
  ./duckdb -c "COPY ${TBL} FROM '${TBL}_${E}.csv';" ${DBFILE}
done
./duckdb -c "COPY ${TBL} TO '${PARQUET_PATH}/cardano_mainchain/${TBL}.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY);" ${DBFILE}

table tx_in_out

this table is very large and thus we will split it into pieces. first, generate the CSV files with the above loop. then, load groups of CSV files into Parquet files.

TBL=tx_in_out
DBFILE=${TBL}.ddb
INCREMENT=20
for E in `seq 0 ${INCREMENT} 500`; do
    if [ -e $DBFILE ]; then rm -v $DBFILE; fi
    ./duckdb ${DBFILE} < schema/${TBL}.sql
    for F in `seq $E 1 $((E+INCREMENT-1))`; do
        if [ -e ${TBL}_${F}.csv ]; then
            ./duckdb -c "COPY ${TBL} FROM '${TBL}_${F}.csv';" ${DBFILE}
        fi
    done
    ./duckdb -c "COPY ${TBL} TO '${PARQUET_PATH}/cardano_mainchain/${TBL}/${TBL}_${E}.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY);" ${DBFILE}
done

querying those files:

`duckdb -c "SELECT epoch_no, COUNT(*) FROM read_parquet('${PARQUET_PATH}/cardano_mainchain/tx_in_out/tx_in_out*.parquet') GROUP BY epoch_no ORDER BY epoch_no;"`

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ext		ext
schema		schema
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract Cardano on-chain data to Parquet files

snapshot

PostgreSQL setup

copy to Parquet files

table tx_in_out

About

Releases

Packages

License

Blockchain-Data-Analytics/Cardano_Mainchain_Parquet

Folders and files

Latest commit

History

Repository files navigation

Extract Cardano on-chain data to Parquet files

snapshot

PostgreSQL setup

copy to Parquet files

table tx_in_out

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Packages