# Ethereum Blockchain

## Dataset

https://www.kaggle.com/bigquery/ethereum-blockchain

## Description

- ### Context

Bitcoin and other cryptocurrencies have captured the imagination of technologists, financiers, and economists. Digital currencies are only one application of the underlying blockchain technology. Like its predecessor, Bitcoin, the Ethereum blockchain can be described as an immutable distributed ledger. However, the set of capabilities have been extended by including a virtual machine that can execute arbitrary code stored on the blockchain known as **Smart Contracts**.

Both Bitcoin and Ethereum are essentially OLTP databases, and provide little in the way of OLAP (analytics) functionality. However the Ethereum dataset is notably distinct from the Bitcoin dataset:

1. The Ethereum blockchain has as its primary unit of value Ether, while the Bitcoin blockchain has Bitcoin. However, the majority of value transfer on the Ethereum blockchain is composed of so-called tokens. Tokens are created and managed by smart contracts.


2. Ether value transfers are precise and direct, resembling accounting ledger debits and credits. This is in contrast to the Bitcoin value transfer mechanism, for which it can be difficult to determine the balance of a given wallet address.


3. Addresses can be not only wallets that hold balances, but can also contain smart contract bytecode that allows the programmatic creation of agreements and automatic triggering of their execution. An aggregate of coordinated smart contracts could be used to build a decentralized autonomous organization.

- ### Content

The Ethereum blockchain data is available for exploration with BigQuery. All historical data is in the ethereum_blockchain dataset, which updates daily. We have leveraged this BigQuery dataset for our research project.

The Dataset consists of 7 tables which will be analysed throughout the project. The tables are-

1. blocks
2. contracts
3. logs
4. token_transfers
5. tokens
6. traces
7. transactions

<hr/>

# Q1. Average, min, max values of all numeric columns in table *transactions*

In [None]:
!pip install google-cloud-bigquery

In [None]:
from google.cloud import bigquery
import pandas as pd

In [None]:
# Jupyter
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "./dav-proj.json"
client = bigquery.Client()

# Google Colab
# from google.colab import auth
# auth.authenticate_user()
# print('Authenticated')
# client = bigquery.Client(project="elite-emitter-321914")

query = """
SELECT *
FROM `bigquery-public-data.crypto_ethereum.transactions`
LIMIT 5;
"""

In [None]:
query_job = client.query(query)

iterator = query_job.result(timeout=30)
rows = list(iterator)

# Transform the rows into a nice pandas dataframe
df = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))

# Look at the first 5
df.head(5)

In [None]:
# Average
df["nonce"].mean()
df["value"].mean()
df["gas"].mean()
df["gas_price"].mean()
df["receipt_cumulative_gas_used"].mean()

In [None]:
# Min
df["nonce"].min()
df["value"].min()
df["gas"].min()
df["gas_price"].min()
df["receipt_cumulative_gas_used"].min()

In [None]:
# Max
df["nonce"].max()
df["value"].max()
df["gas"].max()
df["gas_price"].max()
df["receipt_cumulative_gas_used"].max()

# Q2. Average, min, max values for rows of all numeric columns in table *transactions*

In [None]:
# Average
df[df["nonce"] == df["nonce"].mean())]
df[df["value"] == df["value"].mean())]
df[df["gas"] == df["gas"].mean())]
df[df["gas_price"] == df["gas_price"].mean())]
df[df["receipt_cumulative_gas_used"] == df["receipt_cumulative_gas_used"].mean())]

In [None]:
# Min
df[df["nonce"] == df["nonce"].min())]
df[df["value"] == df["value"].min())]
df[df["gas"] == df["gas"].min())]
df[df["gas_price"] == df["gas_price"].min())]
df[df["receipt_cumulative_gas_used"] == df["receipt_cumulative_gas_used"].min())]

In [None]:
# Max
df[df["nonce"] == df["nonce"].max())]
df[df["value"] == df["value"].max())]
df[df["gas"] == df["gas"].max())]
df[df["gas_price"] == df["gas_price"].max())]
df[df["receipt_cumulative_gas_used"] == df["receipt_cumulative_gas_used"].max())]

# Q3. Find number of missing values of each column

In [None]:
missing = df["nonce"].isnull().sum()
print(missing)

missing = df["value"].isnull().sum()
print(missing)

missing = df["gas"].isnull().sum()
print(missing)

missing = df["gas_price"].isnull().sum()
print(missing)

missing = df["receipt_cumulative_gas_used"].isnull().sum()
print(missing)

# Q4. Find the unique values of each column

In [None]:
pd.unique(df["nonce"])
pd.unique(df["value"])
pd.unique(df["gas"])
pd.unique(df["gas_price"])
pd.unique(df["receipt_cumulative_gas_used"])

# Q5. Find subset of rows based on some condition

In [None]:
gas_used_above_100000 = df[df["receipt_cumulative_gas_used"] > 100000]

# Q6. Find subset of columns based on some condition

In [None]:
gas_used_above_100000 = df["receipt_cumulative_gas_used"] > 100000