This project is focused on extracting Bitcoin blockchain data from Bitcoin Core via its RPC API using a Python ETL process, which stores the data in Parquet files. The schema for the database is as follows:
| Schema | Field | Type | Description | Index |
|---|---|---|---|---|
| Blocks | height |
int32 |
Height of the block | Unique |
block_hash |
string |
Hash of the block | ||
time |
int64 |
Timestamp of the block (Unix) | ||
tx_count |
int32 |
Number of transactions in block | ||
| ------------------- | --------------- | ----------- | --------------------------------- | ---------------- |
| Transactions | height |
int32 |
Block height | Non-Unique |
block_hash |
string |
Hash of the related block | ||
txid |
string |
Transaction ID | ||
is_coinbase |
bool_ |
Is the transaction a coinbase? | ||
| ------------------- | --------------- | ----------- | --------------------------------- | ---------------- |
This guide assumes you're alredy running a full bitcoin node on your machine and have properly configured the RCP API. If this is not the case please follow these instructions first: https://bitcoin.org/en/full-node. Once your node is fully synced you can start using this repo to generate your database.
Setting up a virual environment:
python3 -m venv venv
source venv/bin/activate
pip install -e . And you can use 'deactivate' on the bash when done.
You need to set your RPC API credentials rpc_user and rpc_password for the code to work. Use the python-dotenv library and add them to a .env file on your root folder like this:
RPC_USER=your_rpc_username
RPC_PASSWORD=your_rpc_passwordThen, it's important you run the following unit test first, as they are focused on ensuring the API is properly set up.
python -m unittest discoverThe following commands execute the complete population of each dataset:
python src/blocks/populate_blocks.py
python src/transactions/populate_transactions.py
...Same commands with an example use of their optional parameters:
# Selecting 'start' and 'end' block height
python src/populate_blocks.py --start 10000 --end 20000 The following are the commands execut the relevant Data Quality checks of each dataset:
python src/blocks/blocks_dq.py
python src/transactions/transacitions_dq.py
...You can run these files manually or setup an automated workflow using cron (on linux) that can run automatically or manually.
For the cron job to work seamlessly I prefer to use SSH. I set the agent to start on each machine reboot so it's properly configured to update the agent, environment and key paths by adding this configuration on ~/.bashrc.
Cron can be setup as follows:
crontab -eLine to add to cron for a scheduled midnight run:
0 0 * * * ~/Projects/database_from_Bitcoin_Core/workflow.shExample manual run:
~/Projects/database_from_Bitcoin_Core/workflow.sh