# Transaction Data Collection from the Ethereum Blockchain

## EIP-1559
- Date: August 5, 2021
- Block number: 12,965,000
- [Ethereum JSON-RPC Specification](https://ethereum.github.io/execution-apis/api-documentation/)
- [JSON RPC API](https://ethereum.org/en/developers/docs/apis/json-rpc/)
- [EIP-1559 Analysis Arxiv](https://github.com/SciEcon/EIP1559)

## Layer 2 Solutions Launch Dates
Source: [L2BEAT](https://l2beat.com/scaling/tvl)
1. Optimism is live on: January 16, 2021
2. Arbitrum is live on: August 31, 2021

In [None]:
# !pip install web3
# !pip install pandas

In [1]:
import datetime
import csv
import os
import glob
from datetime import datetime
import time

import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures import ProcessPoolExecutor
from multiprocessing import cpu_count

import pandas as pd
from collections import defaultdict

from web3 import Web3
import eth_abi

In [2]:
def setup_web3(endpoint):
    """
    Initialize a Web3 instance with the given endpoint.
    
    Args:
        endpoint (str): The HTTP provider endpoint to connect to.
    
    Returns:
        Web3: A Web3 instance connected to the provided endpoint.
    """
    return Web3(Web3.HTTPProvider(endpoint))

---
Here is a Python function that uses binary search to find the block number for a given timestamp. This function accepts a target timestamp (in seconds), a Web3 instance for blockchain interaction, and optionally a start and end block to search between. If end_block is not specified, the search is performed up to the latest block. The function performs a binary search to find the block with the timestamp closest to the target.

In [3]:
def get_block_number_by_timestamp(web3, target_timestamp, start_block=0, end_block=None):
    """
    Retrieve the block number for a specific timestamp using binary search.
    
    Args:
        web3 (Web3): A Web3 instance for interacting with the Ethereum blockchain.
        target_timestamp (int): The timestamp (in seconds) to find the block for.
        start_block (int, optional): The block number to start the search from. Defaults to 0.
        end_block (int, optional): The block number to end the search at. If None, the latest block number is used.
    
    Returns:
        int: The block number closest to the target timestamp.
    """
    if end_block is None:
        end_block = web3.eth.block_number  # Use the latest block number if end_block is not specified
    
    while start_block <= end_block:
        mid_block = (start_block + end_block) // 2
        mid_block_timestamp = web3.eth.get_block(mid_block)['timestamp']
        
        if target_timestamp < mid_block_timestamp:
            end_block = mid_block - 1
        elif target_timestamp > mid_block_timestamp:
            start_block = mid_block + 1
        else:
            return mid_block

    # If exact timestamp is not found, return the closest block
    if abs(web3.eth.get_block(start_block)['timestamp'] - target_timestamp) < abs(web3.eth.get_block(end_block)['timestamp'] - target_timestamp):
        return start_block
    else:
        return end_block

To use this function, you would first convert your dates to Unix timestamps (seconds since Jan 1, 1970). You can do this using Python's built-in datetime module:

In [4]:
def timestamp(date_string):
    """
    Convert a date string to a Unix timestamp.
    
    Args:
        date_string (str): The date string in 'YYYY-MM-DD' format.
    
    Returns:
        int: The Unix timestamp corresponding to the date string.
    """
    dt = datetime.strptime(date_string, "%Y-%m-%d")
    return int(dt.timestamp())


---
To implement the token transfer detection, you would need the `eth_abi` package to decode contract input data and a way to check if a given address is a contract address. However, please note that correctly detecting and classifying all Ethereum transaction types with 100% accuracy is a complex task and requires deep analysis of the input data and possibly contract state. Therefore, the following simplified approach may not catch every edge case.

In [5]:
def is_contract(web3, address):
    """
    Determine if an address is a contract address.
    """
    return web3.eth.get_code(address) != b''

The process to identify if a transaction is a simple Ether transfer, Token transfer, or Smart contract interaction is not straightforward because of the versatile nature of Ethereum transactions. 

1. **Simple Ether Transfers:** These are transactions where Ether is transferred from one account to another. These transactions can be identified by checking if the `input` data field is '0x' (i.e., empty) and the `to` field is not a contract address (it doesn't have associated bytecode).

2. **Token Transfers (ERC-20, ERC-721, etc.):** These are transactions where tokens (like USDT, DAI, etc.) are transferred. Token transfers follow a standard set of rules defined in the ERC-20 or ERC-721 specifications. One of the main methods in these specifications is `transfer(...)`. When this method is called, the `input` data field in the transaction starts with the method ID derived from the keccak256 hash of the method signature. For example, the `transfer(...)` method's ID is '0xa9059cbb'. So, you could identify ERC-20 token transfers by checking if the `input` field starts with this method ID. However, this check can yield false positives, as other contracts can use the same method ID for different purposes. Identifying ERC-721 (NFT) transfers can be more challenging, as multiple methods (`safeTransferFrom(...)`, `transferFrom(...)`, etc.) can be used to transfer tokens. 

3. **Smart Contract Interactions:** If the `input` data field is not '0x' (empty), and the `input` does not correspond to a standard ERC-20 or ERC-721 token transfer, you can classify the transaction as a smart contract interaction.

Please note that these checks provide a rough classification of the transactions but aren't perfect. For example, they don't account for cases where a single transaction involves both Ether and token transfers or multiple types of contract interactions. Further, they don't cover all token standards (like ERC-1155, which allows for both fungible and non-fungible tokens).

As a result, if you need a precise classification of transaction types, you might need to employ more sophisticated methods, such as analyzing the bytecode of contracts, tracking the state changes of known contracts, or using specialized services or libraries that provide this kind of analysis.

---

## Data Collection

For the data collection, we use Python's `concurrent.futures` library to parallelize the extraction of block data. The key here is to split the block range into smaller ranges, and handle each of these ranges concurrently using a separate thread.

In [6]:
def extract_block_data(web3, block):
    """
    Extracts transaction data from a block.
    """
    ERC20_TRANSFER_SIGNATURE = '0xa9059cbb'
    EIP_1559_ACTIVATION_BLOCK = 12965000 # Block at which EIP-1559 was activated
    
    block_data = web3.eth.get_block(block, full_transactions=True)
    data = []
    for tx in block_data['transactions']:
        # Define transaction type
        if tx['to'] is None:  # handle contract creation
            tx_type = 'Contract Creation'
        elif tx['input'] == '0x' and not is_contract(web3, tx['to']):
            tx_type = 'Simple Ether Transfer'
        elif tx['input'][:10] == ERC20_TRANSFER_SIGNATURE:
            tx_type = 'ERC20 Transfer'
        else:
            tx_type = 'Interaction with a Contract'

        # Define EIP-1559 type
        if block < EIP_1559_ACTIVATION_BLOCK:
            eip_1559_type = 'Legacy'
        else:
            eip_1559_type = 'Legacy' if tx['type'] in [0, 1] else 'EIP-1559' if tx['type'] == 2 else 'Unknown'


        tx_data = {
            'Transaction Identifier': tx['hash'].hex(),
            'Block Number': block,
            'Transaction Timestamp': block_data['timestamp'],
            'Transaction Status': 'Success' if tx['to'] is not None else 'Failed',
            'Gas Price': tx['gasPrice'],
            'Transaction Fee': tx['gas'] * tx['gasPrice'],
            'Sender\'s Address': tx['from'],
            'Transaction Type': tx_type,
            'Transaction EIP-1559 Type': eip_1559_type
        }
        data.append(tx_data)
    return data

In [7]:
def extract_block_data_wrapper(args):
    block = args[0]
    node_url = args[1]  # We're passing the URL here instead of the Web3 object

    # Re-initialize the Web3 object here
    web3 = Web3(Web3.HTTPProvider(node_url))

    return extract_block_data(web3, block)

In [8]:
def collect_data(node_url, start_block, end_block):
    """
    Collect transaction data from a range of blocks on the Ethereum blockchain.
    """
    csvfile = None
    writer = None
    last_date_str = None
    daily_transactions = []

    # Define the chunk size for the executor
    chunk_size = 10

    # Use the ProcessPoolExecutor for multiprocessing
    with ProcessPoolExecutor(max_workers=cpu_count()) as executor:
        for block_data in executor.map(extract_block_data_wrapper, [(block, node_url) for block in range(start_block, end_block + 1)], chunksize=chunk_size):
            for tx_data in block_data:
                date_str = datetime.utcfromtimestamp(tx_data['Transaction Timestamp']).strftime('%Y-%m-%d')

                # If date has changed, write daily transactions to file and start a new day
                if date_str != last_date_str:
                    if csvfile is not None:
                        writer.writerows(daily_transactions)
                        csvfile.close()

                    file_name = f"../data/eth_transaction_data_{date_str}.csv"
                    csvfile = open(file_name, 'a', newline='')
                    fieldnames = list(tx_data.keys())
                    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

                    # Write header only if the file is newly created
                    if csvfile.tell() == 0:
                        writer.writeheader()

                    daily_transactions = []
                    last_date_str = date_str

                daily_transactions.append(tx_data)

    # Write transactions for the last day and close the file
    if csvfile is not None:
        writer.writerows(daily_transactions)
        csvfile.close()

This script above now identifies simple Ether transfers, token transfers, and smart contract interactions according to the descriptions. It uses the `getCode` function to check if the `to` address is a contract, and checks the first 4 bytes (8 hex characters) of the `input` field against the method ID for the ERC-20 `transfer` function to identify token transfers. If the `input` field is not empty and doesn't correspond to a token transfer, the transaction is classified as a smart contract interaction.

In this code, the `ThreadPoolExecutor` manages a pool of worker threads. The `executor.map` function is used to apply the `extract_block_data` function to each block number in the specified range. The result is a list of transaction data for each block, which is then appended to the data list.

This approach should significantly speed up data collection by utilizing multiple cores and threads on your machine. The actual speedup depends on the number of cores and threads your machine has, as well as other factors like network latency and the I/O performance of your machine.

---

## Steps to Collect Data

### Step 1: Initialize a Web3 instance with Infura endpoint or local Ethereum node.

In [9]:
# # Use Infura API endpoint
# infura_url = os.getenv("INFURA_MAINNET_URL")

# if not infura_url:
#     raise ValueError("INFURA_URL is not set in the environment variables")

# web3 = setup_web3(infura_url)

# Use Ethereum node on local machine
node_url = 'http://localhost:8545'
web3 = Web3(Web3.HTTPProvider(node_url))

print(web3.is_connected())  # should return True if the connection is successful

True


### Step 2: Convert my start and end dates to Unix timestamps and then to block numbers.

In [11]:
# Run only once to get the start block and end block for the specified time period
start_date = "2021-02-05"
end_date = "2022-02-05"

start_timestamp = timestamp(start_date)
end_timestamp = timestamp(end_date)

start_block = get_block_number_by_timestamp(web3, start_timestamp)
end_block = get_block_number_by_timestamp(web3, end_timestamp)

print(f"The block number at {start_date} is {start_block}")
print(f"The block number at {end_date} is {end_block}")

The block number at 2021-02-05 is 11794239
The block number at 2022-02-05 is 14143963


### Step 3: Collect the transaction data \& write to CSV files, one file for each day

In [None]:
# Record start time
start_time = time.time()

# Call `collect_data` function
collect_data(node_url, start_block, end_block)

# Record the end time
end_time = time.time()

# Calculate the difference in seconds, then convert to minute
execution_time_minutes = (end_time - start_time) / 60

print(f"The code took {execution_time_minutes} minutes to run.")