database_from_Bitcoin_Core

This project is focused on extracting Bitcoin blockchain data from Bitcoin Core via its RPC API using a Python ETL process, which stores the data in Parquet files. The schema for the database is as follows:

Schema	Field	Type	Description	Index
Blocks	`height`	`int32`	Height of the block	Unique
	`block_hash`	`string`	Hash of the block
	`time`	`int64`	Timestamp of the block (Unix)
	`tx_count`	`int32`	Number of transactions in block
-------------------	---------------	-----------	---------------------------------	----------------
Transactions	`height`	`int32`	Block height	Non-Unique
	`block_hash`	`string`	Hash of the related block
	`txid`	`string`	Transaction ID
	`is_coinbase`	`bool_`	Is the transaction a coinbase?
-------------------	---------------	-----------	---------------------------------	----------------

Bitcoin Core setup

This guide assumes you're alredy running a full bitcoin node on your machine and have properly configured the RCP API. If this is not the case please follow these instructions first: https://bitcoin.org/en/full-node. Once your node is fully synced you can start using this repo to generate your database.

Python environment setup

Setting up a virual environment:

python3 -m venv venv
source venv/bin/activate
pip install -e .

And you can use 'deactivate' on the bash when done.

You need to set your RPC API credentials rpc_user and rpc_password for the code to work. Use the python-dotenv library and add them to a .env file on your root folder like this:

RPC_USER=your_rpc_username
RPC_PASSWORD=your_rpc_password

Then, it's important you run the following unit test first, as they are focused on ensuring the API is properly set up.

python -m unittest discover

ETL process

The following commands execute the complete population of each dataset:

python src/blocks/populate_blocks.py
python src/transactions/populate_transactions.py
...

Same commands with an example use of their optional parameters:

# Selecting 'start' and 'end' block height
python src/populate_blocks.py --start 10000 --end 20000

DQ process

The following are the commands execut the relevant Data Quality checks of each dataset:

python src/blocks/blocks_dq.py
python src/transactions/transacitions_dq.py 
...

Workflow automation

You can run these files manually or setup an automated workflow using cron (on linux) that can run automatically or manually.

For the cron job to work seamlessly I prefer to use SSH. I set the agent to start on each machine reboot so it's properly configured to update the agent, environment and key paths by adding this configuration on ~/.bashrc.

Cron can be setup as follows:

crontab -e

Line to add to cron for a scheduled midnight run:

0 0 * * * ~/Projects/database_from_Bitcoin_Core/workflow.sh

Example manual run:

~/Projects/database_from_Bitcoin_Core/workflow.sh

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
dq.py		dq.py
etl.py		etl.py
graph_build.py		graph_build.py
repo_structure.txt		repo_structure.txt
requirements.txt		requirements.txt
setup.py		setup.py
workflow.sh		workflow.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

database_from_Bitcoin_Core

Bitcoin Core setup

Python environment setup

ETL process

DQ process

Workflow automation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

benitotrm/database_from_Bitcoin_Core

Folders and files

Latest commit

History

Repository files navigation

database_from_Bitcoin_Core

Bitcoin Core setup

Python environment setup

ETL process

DQ process

Workflow automation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages