# Ethereum Blockchain

## Dataset

https://www.kaggle.com/bigquery/ethereum-blockchain

## Description

- ### Context

Bitcoin and other cryptocurrencies have captured the imagination of technologists, financiers, and economists. Digital currencies are only one application of the underlying blockchain technology. Like its predecessor, Bitcoin, the Ethereum blockchain can be described as an immutable distributed ledger. However, the set of capabilities have been extended by including a virtual machine that can execute arbitrary code stored on the blockchain known as **Smart Contracts**.

Both Bitcoin and Ethereum are essentially OLTP databases, and provide little in the way of OLAP (analytics) functionality. However the Ethereum dataset is notably distinct from the Bitcoin dataset:

1. The Ethereum blockchain has as its primary unit of value Ether, while the Bitcoin blockchain has Bitcoin. However, the majority of value transfer on the Ethereum blockchain is composed of so-called tokens. Tokens are created and managed by smart contracts.


2. Ether value transfers are precise and direct, resembling accounting ledger debits and credits. This is in contrast to the Bitcoin value transfer mechanism, for which it can be difficult to determine the balance of a given wallet address.


3. Addresses can be not only wallets that hold balances, but can also contain smart contract bytecode that allows the programmatic creation of agreements and automatic triggering of their execution. An aggregate of coordinated smart contracts could be used to build a decentralized autonomous organization.

- ### Content

The Ethereum blockchain data is available for exploration with BigQuery. All historical data is in the ethereum_blockchain dataset, which updates daily. We have leveraged this BigQuery dataset for our research project.

The Dataset consists of 7 tables which will be analysed throughout the project. The tables are-

1. blocks
2. contracts
3. logs
4. token_transfers
5. tokens
6. traces
7. transactions

<hr/>

# Q1. Average, min, max values of all numeric columns in table *transactions*

In [12]:
from google.cloud import bigquery
import pandas as pd

In [13]:
client = bigquery.Client()

query = """
SELECT *
FROM `bigquery-public-data.crypto_ethereum.transactions`
LIMIT 5;
"""

Using Kaggle's public dataset BigQuery integration.


In [14]:
query_job = client.query(query)

iterator = query_job.result(timeout=30)
rows = list(iterator)

# Transform the rows into a nice pandas dataframe
df = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))

# Look at the first 5
df.head(5)

Unnamed: 0,hash,nonce,transaction_index,from_address,to_address,value,gas,gas_price,input,receipt_cumulative_gas_used,...,receipt_contract_address,receipt_root,receipt_status,block_timestamp,block_number,block_hash,max_fee_per_gas,max_priority_fee_per_gas,transaction_type,receipt_effective_gas_price
0,0x1abb7afdfc3786fbc82ce207ab5f5f0a98972aff4673...,77,0,0x81b69a3ede12a18a5b78c1206635b3857ec19022,0xdef9562610d9a04bc98d3a55d6f414d07222c1c9,1,4000000,41000000000,0x5e29a869000000000000000000000000000000000000...,4000000,...,,0xd0fbe62230c1fd25f202b1f8ab6fca72c39afb824df3...,,2017-02-14 03:39:36+00:00,3179624,0x1235f6fb1f12fbee7627d1e2e9ae5ce9f99da9eed29c...,,,,41000000000
1,0xbafefdfb74670ac8302b3afc1554ccb687eb9b8c4378...,95,0,0x81b69a3ede12a18a5b78c1206635b3857ec19022,0x0a7bcf4b9f16e4d731a7e973636dda396ed961d4,1,4000000,41000000000,0x5e29a869000000000000000000000000000000000000...,4000000,...,,0xb388e246083cb5416a01dc47e34a60b0feac2624a2fe...,,2017-02-14 07:01:26+00:00,3180449,0x5d130b10874fc25f82e14a42875a9b41e8cc7bb7559b...,,,,41000000000
2,0x0bf704126f20520e93f7a90171ad48a9c4136a633b4d...,131,0,0xa77de49902bf9bb46cc932730b54867607953b61,0x2a04cfc0d4838e223505438dcd18d425d952c9b1,0,800000,28000000000,0xa9059cbb000000000000000000000000999cc76e025c...,78272,...,,0xa5cf512182aaea1035efdf3e8be6f6accb76eecb6ceb...,,2017-02-14 17:20:06+00:00,3183105,0xcb7c1d8d602b6eb2f0f1c9c9416b37ed29031aeb4aa6...,,,,28000000000
3,0x6ca4c60a5ca5c94fc685e968d2203caf1f3c73a1967f...,44,7,0x4ac5ed4b31f4f0e16bd96bf2d9f96200d7c266af,0xd51b61da6d774f58df1c933fa1e64750b35c109a,0,800000,20000000000,0x8ee0dab1,1021408,...,,0x30e3fdfc5a557cc5b3476168dcd20096e0855f76ec1c...,,2017-02-14 07:03:25+00:00,3180456,0x86d2cec33ccf6ffce1f1195d4b8f83ce8fb13ceafd7e...,,,,20000000000
4,0xebd1e558092496c02769c2e2cb1b54731657a29249b5...,42,15,0x4ac5ed4b31f4f0e16bd96bf2d9f96200d7c266af,0xd51b61da6d774f58df1c933fa1e64750b35c109a,0,800000,20000000000,0x8ee0dab1,1792942,...,,0x96c558e0cf2985577ad2171b9eae4c52087373d4a979...,,2017-02-14 06:04:51+00:00,3180210,0xfe4fee37ef998364f184529b59db1e9de67705a7af46...,,,,20000000000


In [15]:
# Average
print(df["nonce"].mean())
print(df["value"].mean())
print(df["gas"].mean())
print(df["gas_price"].mean())
print(df["receipt_cumulative_gas_used"].mean())

77.8
0.4
2080000.0
30000000000.0
2178524.4


In [17]:
# Min
print(df["nonce"].min())
print(df["value"].min())
print(df["gas"].min())
print(df["gas_price"].min())
print(df["receipt_cumulative_gas_used"].min())

42
0
800000
20000000000
78272


In [18]:
# Max
print(df["nonce"].max())
print(df["value"].max())
print(df["gas"].max())
print(df["gas_price"].max())
print(df["receipt_cumulative_gas_used"].max())

131
1
4000000
41000000000
4000000


# Q2. Average, min, max values for rows of all numeric columns in table *transactions*

In [20]:
# Average
df[df["nonce"] == df["nonce"].mean()]
df[df["value"] == df["value"].mean()]
df[df["gas"] == df["gas"].mean()]
df[df["gas_price"] == df["gas_price"].mean()]
df[df["receipt_cumulative_gas_used"] == df["receipt_cumulative_gas_used"].mean()]

Unnamed: 0,hash,nonce,transaction_index,from_address,to_address,value,gas,gas_price,input,receipt_cumulative_gas_used,...,receipt_contract_address,receipt_root,receipt_status,block_timestamp,block_number,block_hash,max_fee_per_gas,max_priority_fee_per_gas,transaction_type,receipt_effective_gas_price


In [21]:
# Min
df[df["nonce"] == df["nonce"].min()]
df[df["value"] == df["value"].min()]
df[df["gas"] == df["gas"].min()]
df[df["gas_price"] == df["gas_price"].min()]
df[df["receipt_cumulative_gas_used"] == df["receipt_cumulative_gas_used"].min()]

Unnamed: 0,hash,nonce,transaction_index,from_address,to_address,value,gas,gas_price,input,receipt_cumulative_gas_used,...,receipt_contract_address,receipt_root,receipt_status,block_timestamp,block_number,block_hash,max_fee_per_gas,max_priority_fee_per_gas,transaction_type,receipt_effective_gas_price
2,0x0bf704126f20520e93f7a90171ad48a9c4136a633b4d...,131,0,0xa77de49902bf9bb46cc932730b54867607953b61,0x2a04cfc0d4838e223505438dcd18d425d952c9b1,0,800000,28000000000,0xa9059cbb000000000000000000000000999cc76e025c...,78272,...,,0xa5cf512182aaea1035efdf3e8be6f6accb76eecb6ceb...,,2017-02-14 17:20:06+00:00,3183105,0xcb7c1d8d602b6eb2f0f1c9c9416b37ed29031aeb4aa6...,,,,28000000000


In [22]:
# Max
df[df["nonce"] == df["nonce"].max()]
df[df["value"] == df["value"].max()]
df[df["gas"] == df["gas"].max()]
df[df["gas_price"] == df["gas_price"].max()]
df[df["receipt_cumulative_gas_used"] == df["receipt_cumulative_gas_used"].max()]

Unnamed: 0,hash,nonce,transaction_index,from_address,to_address,value,gas,gas_price,input,receipt_cumulative_gas_used,...,receipt_contract_address,receipt_root,receipt_status,block_timestamp,block_number,block_hash,max_fee_per_gas,max_priority_fee_per_gas,transaction_type,receipt_effective_gas_price
0,0x1abb7afdfc3786fbc82ce207ab5f5f0a98972aff4673...,77,0,0x81b69a3ede12a18a5b78c1206635b3857ec19022,0xdef9562610d9a04bc98d3a55d6f414d07222c1c9,1,4000000,41000000000,0x5e29a869000000000000000000000000000000000000...,4000000,...,,0xd0fbe62230c1fd25f202b1f8ab6fca72c39afb824df3...,,2017-02-14 03:39:36+00:00,3179624,0x1235f6fb1f12fbee7627d1e2e9ae5ce9f99da9eed29c...,,,,41000000000
1,0xbafefdfb74670ac8302b3afc1554ccb687eb9b8c4378...,95,0,0x81b69a3ede12a18a5b78c1206635b3857ec19022,0x0a7bcf4b9f16e4d731a7e973636dda396ed961d4,1,4000000,41000000000,0x5e29a869000000000000000000000000000000000000...,4000000,...,,0xb388e246083cb5416a01dc47e34a60b0feac2624a2fe...,,2017-02-14 07:01:26+00:00,3180449,0x5d130b10874fc25f82e14a42875a9b41e8cc7bb7559b...,,,,41000000000


# Q3. Find number of missing values of each column

In [23]:
missing = df["nonce"].isnull().sum()
print(missing)

missing = df["value"].isnull().sum()
print(missing)

missing = df["gas"].isnull().sum()
print(missing)

missing = df["gas_price"].isnull().sum()
print(missing)

missing = df["receipt_cumulative_gas_used"].isnull().sum()
print(missing)

0
0
0
0
0


# Q4. Find the unique values of each column

In [24]:
pd.unique(df["nonce"])
pd.unique(df["value"])
pd.unique(df["gas"])
pd.unique(df["gas_price"])
pd.unique(df["receipt_cumulative_gas_used"])

array([4000000,   78272, 1021408, 1792942])

# Q5. Find subset of rows based on some condition

In [30]:
gas_used_above_100000 = df[df["receipt_cumulative_gas_used"] > 100000]
print(gas_used_above_100000)

                                                hash  nonce  \
0  0x1abb7afdfc3786fbc82ce207ab5f5f0a98972aff4673...     77   
1  0xbafefdfb74670ac8302b3afc1554ccb687eb9b8c4378...     95   
3  0x6ca4c60a5ca5c94fc685e968d2203caf1f3c73a1967f...     44   
4  0xebd1e558092496c02769c2e2cb1b54731657a29249b5...     42   

   transaction_index                                from_address  \
0                  0  0x81b69a3ede12a18a5b78c1206635b3857ec19022   
1                  0  0x81b69a3ede12a18a5b78c1206635b3857ec19022   
3                  7  0x4ac5ed4b31f4f0e16bd96bf2d9f96200d7c266af   
4                 15  0x4ac5ed4b31f4f0e16bd96bf2d9f96200d7c266af   

                                   to_address value      gas    gas_price  \
0  0xdef9562610d9a04bc98d3a55d6f414d07222c1c9     1  4000000  41000000000   
1  0x0a7bcf4b9f16e4d731a7e973636dda396ed961d4     1  4000000  41000000000   
3  0xd51b61da6d774f58df1c933fa1e64750b35c109a     0   800000  20000000000   
4  0xd51b61da6d774f58df1c933fa1e647

# Q6. Find subset of columns based on some condition

In [31]:
gas_used_above_100000 = df["receipt_cumulative_gas_used"] > 100000
print(gas_used_above_100000)

0     True
1     True
2    False
3     True
4     True
Name: receipt_cumulative_gas_used, dtype: bool
