# Ethereum transactions analysis

## Read the eth_transactions.json files

In [1]:
import os
import json

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
PATH = '../data/'

In [3]:
pd.set_option('display.max_columns', 100, 'display.max_rows', 100, 'display.max_colwidth', 100)
# pd.set_option('display.float_format', lambda x: '%.f' % x)

In [4]:
file_dir = os.listdir(PATH)
file_list = [os.path.join(PATH, file) for file in file_dir if file.endswith('parquet')]

In [5]:
df_raw = pd.concat([pd.read_parquet(file) for file in file_list])

In [6]:
df_raw.shape

(5493838, 20)

### Convert `type` to categorical data type

In [7]:
df_raw['type'] = df_raw['type'].astype('category')

In [8]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5493838 entries, 0 to 2791715
Data columns (total 20 columns):
 #   Column                Dtype         
---  ------                -----         
 0   hash                  object        
 1   blockHash             object        
 2   blockNumber           int64         
 3   from                  object        
 4   gas                   int64         
 5   gasPrice              int64         
 6   input                 object        
 7   nonce                 int64         
 8   r                     object        
 9   s                     object        
 10  to                    object        
 11  transactionIndex      int64         
 12  type                  category      
 13  v                     int64         
 14  value                 float64       
 15  accessList            object        
 16  chainId               float64       
 17  maxFeePerGas          float64       
 18  maxPriorityFeePerGas  float64       
 19  

## Data fields
- `hash` - Hash of the transaction
- `blockHash` - Hash of the block
- `blockNumber` - Block number
- `from` - Address of the sender
- `gas` - Gas provided by the sender
- `gasPrice` - Gas price provided by the sender in Wei
- `input` - The data sent along with the transaction. Commonly used as part of contract interaction or as a message sent to the recipient. See 
- `nonce` - The number of transactions made by the sender prior to this one
- `r` - The Elliptic Curve Digital Signature Algorithm (ECDSA) signature r. The standardised R field of the signature. See: https://openethereum.github.io/JSONRPC
- `s` - The Elliptic Curve Digital Signature Algorithm (ECDSA) signature s
- `to` - Address of the receiver
- `transactionIndex` - Integer of the transactions index position in the block
- `type` - Might be transaction type (unable to find this in Alchemy doc but the value looks the same as `transaction_type` in bigquery public dataset)
- `v` - The Elliptic Curve Digital Signature Algorithm (ECDSA) recovery id. The standardised V field of the signature. See: https://openethereum.github.io/JSONRPC
- `value` - The amount of ether transferred in Wei.
- `accessList` - Contains addresses and storage keys that will be accessed. These are fields from legacy transactions, EIP-2930. See: https://openethereum.github.io/JSONRPC
- `chainId` - Value used in replay-protected transaction signing as introduced by EIP-155
- `maxFeePerGas` - The maximum fee per gas the transaction sender is willing to pay total (introduced by EIP1559). For detailed explanation, refer to https://docs.alchemy.com/alchemy/guides/eip-1559/maxpriorityfeepergas-vs-maxfeepergas
- `maxPriorityFeePerGas` - The maximum fee per gas the transaction sender is willing to pay total (introduced by EIP1559) Refer to https://docs.alchemy.com/alchemy/guides/eip-1559/maxpriorityfeepergas-vs-maxfeepergas
- `block_timestamp` - Timestamp of the block

References:  
https://docs.dune.com/data-tables/data-tables/raw-data/ethereum-data#ethereum.transactions  
The "bigquery-public-data.crypto_ethereum.transactions" column description  
https://ethereum.org/en/developers/docs/apis/json-rpc/  
https://docs.alchemy.com/alchemy/apis/ethereum/eth-gettransactionbyhash  

## We have the following entities in our domain:
- Blocks
- Transaction
- Value and Fees
- Account

The process/relationship:
- A block contains multiple transactions
- A transaction is a request for computation on the Ethereum Virtual Machine (EVM) and it is a fulfilled transaction request and the associated change in the EVM.  
When a request is broadcast, other participants on the network verify, validate and carry out the computation.  
A transaction refers to an action initiated by an externally owned account, in other words, managed by a human not a contract (smart contract, ie. controlled by code)
A transaction requires a fee and must be mined to be valid.
- An account can be externally-owned or a contract (smart contract)

References:
https://ethereum.org/en/developers/docs/intro-to-ethereum/

## We have the following information about our entities:
- Blocks: `blockHash`, `blockNumber`, `block_timestamp`
- Transaction: `hash`, `from`, `to`, `transactionIndex`, `type`, `input`, `chainId`
- Value and Fees: `value`, `gas`, `gasPrice`, `maxFeePerGas`, `maxPriorityFeePerGas`
- Account: `nonce`

In [49]:
df_raw.head(3)

Unnamed: 0,hash,blockHash,blockNumber,from,gas,gasPrice,input,nonce,r,s,to,transactionIndex,type,v,value,accessList,chainId,maxFeePerGas,maxPriorityFeePerGas,block_timestamp
0,0xfedacdb532ec5525686557a9bca04daa357ba754ccd70bea7cf0459572c189cf,0x5502093582eef8f1c9088f8db764ab59f403787954dee81eeaf681bf8ba681a7,15064948,0xf71e4a144cda4498277f9ad89b6501ec6c83c27c,21000,33084805662,0x,535,0xdc2fccbf51c8dde791b9ec027d4d53aab9bf6d5c10f963135a1bff925203dba7,0x1c0b133e837b36281fd6bdb9abb907c2060db9f6ab1824205037f3069458cafe,0xf0fb796c4ea2f2b24939ddb20d4b66144b3ed4bc,0,0,27,1614273837860304,,,,,2022-07-02 19:17:48
1,0xf40f05907bfce0363bc128055d6a1270ea2469f072166c69e3358da754a0c1ba,0x5502093582eef8f1c9088f8db764ab59f403787954dee81eeaf681bf8ba681a7,15064948,0xf0fb796c4ea2f2b24939ddb20d4b66144b3ed4bc,48792,33084805662,0xa9059cbb00000000000000000000000013e68bdecfe11ae8c314307516e8b027057ddac00000000000000000000000...,181,0x3f6a162d208194c1189ef6fbc0120a5a01cba85a0a1ff06f227b6fb645832435,0x7dcd65f3728f916430683013eabea52dbae65666e134e776a4c789cd1ff53161,0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48,1,0,27,0,,,,,2022-07-02 19:17:48
2,0x9f7e7d4bed78c97f3018113e620141073b6121e6cb224f4183997648e180d341,0x5502093582eef8f1c9088f8db764ab59f403787954dee81eeaf681bf8ba681a7,15064948,0x65a8f07bd9a8598e1b5b6c0a88f4779dbc077675,601375,23648988340,0x7c0252000000000000000000000000002057cfb9fd11837d61b294d514c5bd03e5e7189a0000000000000000000000...,25331,0x344ce402827c5d35864b9dd52e6211def0eacba544dd7b47d25c8f1867559b73,0x97749f02882a8740bd3b7e6f5e0abdb9e8844542fd32b3ea2d2a05ee2890063,0x1111111254fb6c44bac0bed2854e76f90643097d,2,2,0,0,[],1.0,250000000000.0,13000000000.0,2022-07-02 19:17:48


How many blocks are there in our dataset?

In [46]:
df_raw['blockNumber'].nunique()

29029

How many blocks are there each day in our dataset?

In [8]:
df_raw.resample('D', on='block_timestamp').agg({'blockNumber':'nunique'})

Unnamed: 0_level_0,blockNumber
block_timestamp,Unnamed: 1_level_1
2022-06-29,3552
2022-06-30,5475
2022-07-01,6253
2022-07-02,6227
2022-07-03,6315
2022-07-04,1207


How many transactions?

In [51]:
df_raw.shape[0]

5493838

How many senders?

In [50]:
df_raw['from'].nunique()

1261792

How many recipients?

In [9]:
df_raw['to'].nunique()

966315

How many days of data?

In [53]:
df_raw.resample('D', on='block_timestamp').size()

block_timestamp
2022-06-29     698486
2022-06-30    1054433
2022-07-01    1178620
2022-07-02    1189382
2022-07-03    1157052
2022-07-04     215865
Freq: D, dtype: int64

What is the minimum and maximum `gas`?

In [54]:
df_raw['gas'].describe()

count    5493838
mean      165523
std       381351
min        21000
25%        27938
50%        90000
75%       207128
max     30029179
Name: gas, dtype: float64

What is the minimum and maximum `gasPrice`?

In [59]:
df_raw['gasPrice'].describe()

count           5493838
mean        37222051738
std        114347052967
min          3717513949
25%         14562386787
50%         24496026516
75%         43000000000
max     150000000000000
Name: gasPrice, dtype: float64

In [8]:
df_raw['nonce'].describe()

count    5493838
mean     1369095
std      5874673
min            0
25%           16
50%          200
75%        35888
max     43702751
Name: nonce, dtype: float64

In [12]:
df_raw['transactionIndex'].describe()

count   5493838
mean        139
std         115
min           0
25%          50
50%         113
75%         204
max        1307
Name: transactionIndex, dtype: float64

In [13]:
df_raw['type'].value_counts(dropna=False)

2    4623591
0     858767
1      11480
Name: type, dtype: int64

In [24]:
df_raw['type'].value_counts(dropna=False, normalize=True)

2    0.841596
0    0.156315
1    0.002090
Name: type, dtype: float64

In [18]:
df_raw['type'].value_counts().sum()

5493838

In [32]:
# pd.set_option('display.float_format', lambda x: '%.f' % x)

In [29]:
df_raw['value'].describe()

count                    5493838
mean         1935085034425374720
std        244890694547030605824
min                            0
25%                            0
50%                            0
75%            50000000000000000
max     249999999999999995805696
Name: value, dtype: float64

In [38]:
df_raw['chainId'].value_counts(dropna=False, normalize=True)

1.0    0.843685
NaN    0.156315
Name: chainId, dtype: float64

In [44]:
# pd.set_option('display.float_format', lambda x: '%.f' % x)

In [43]:
df_raw['maxFeePerGas'].describe()

count          4623591
mean       79362970949
std       195638185747
min         3717513949
25%        22474826048
50%        39639152684
75%        76912194134
max     50456218000000
Name: maxFeePerGas, dtype: float64

In [45]:
df_raw['maxPriorityFeePerGas'].describe()

count          4623591
mean        7180894508
std        91090595491
min                  0
25%         1500000000
50%         2000000000
75%         2500000000
max     49398545703515
Name: maxPriorityFeePerGas, dtype: float64

## Overall figures
In this analysis, we will not look into:  
The `r`, `s` and `v` which are the ECDSA signatures.  
`accessList` which contains addresses and storage keys from legacy transactions. The field also needs to be flattened.  


- How many blocks are there in our dataset?  
29,029
- How many transactions?  
5,493,838
- How many senders:  
1,261,792
- How many recipients:  
966,315
- What is the min, max, median and average `gas`?  
min:21,000; max:30,029,179; median:90,000; mean:165,523
- What is the min, max, median and average `gasPrice`?    
min:3,717,513,949; max:150,000,000,000,000; median:24,496,026,516; mean:37,222,051,738
- What is the min, max, median and average `nonce`?  
min:0; max:43,702,751; median:200; mean:1,369,095
- What is the min, max, median and average `transactionIndex`?  
min:0; max:1,307; median:113; mean:139
- What are the categories in `type`?  
84% of transactions are 2, 16% are 0 and 0.002% are 1
- What are the categories in `chainId`
84% are 1 and 16% are missing the `chainId`
- What is the min, max, median and average `value`?  
min:0; max:249,999,999,999,999,995,805,696; median:0; mean:1,935,085,034,425,374,720
- Which dates were these transactions executed?
2022-06-29 to 2022-07-04

## Key Findings
- There are more senders than recipients. Why?
- The 84% of transactions has a `type` of 2 and 16% are 0. The 84% vs 16% split is the same as `chainId` split. What is the relationship? 