# Transaction Data Collection from the Ethereum Blockchain

## EIP-1559
- Date: August 5, 2021
- Block number: 12,965,000
- [Ethereum JSON-RPC Specification](https://ethereum.github.io/execution-apis/api-documentation/)
- [JSON RPC API](https://ethereum.org/en/developers/docs/apis/json-rpc/)
- [EIP-1559 Analysis Arxiv](https://github.com/SciEcon/EIP1559)

## Layer 2 Solutions Launch Dates
Source: [L2BEAT](https://l2beat.com/scaling/tvl)
1. Optimism is live on: January 16, 2021
2. Arbitrum is live on: August 31, 2021

In [1]:
# !pip install web3
# !pip install pandas

In [2]:
import datetime
import csv
import os
from datetime import datetime
import concurrent.futures

import pandas as pd
from collections import defaultdict

from web3 import Web3
import eth_abi

In [3]:
def setup_web3(endpoint):
    """
    Initialize a Web3 instance with the given endpoint.
    
    Args:
        endpoint (str): The HTTP provider endpoint to connect to.
    
    Returns:
        Web3: A Web3 instance connected to the provided endpoint.
    """
    return Web3(Web3.HTTPProvider(endpoint))

---
Here is a Python function that uses binary search to find the block number for a given timestamp. This function accepts a target timestamp (in seconds), a Web3 instance for blockchain interaction, and optionally a start and end block to search between. If end_block is not specified, the search is performed up to the latest block. The function performs a binary search to find the block with the timestamp closest to the target.

In [4]:
def get_block_number_by_timestamp(web3, target_timestamp, start_block=0, end_block=None):
    """
    Retrieve the block number for a specific timestamp using binary search.
    
    Args:
        web3 (Web3): A Web3 instance for interacting with the Ethereum blockchain.
        target_timestamp (int): The timestamp (in seconds) to find the block for.
        start_block (int, optional): The block number to start the search from. Defaults to 0.
        end_block (int, optional): The block number to end the search at. If None, the latest block number is used.
    
    Returns:
        int: The block number closest to the target timestamp.
    """
    if end_block is None:
        end_block = web3.eth.block_number  # Use the latest block number if end_block is not specified
    
    while start_block <= end_block:
        mid_block = (start_block + end_block) // 2
        mid_block_timestamp = web3.eth.get_block(mid_block)['timestamp']
        
        if target_timestamp < mid_block_timestamp:
            end_block = mid_block - 1
        elif target_timestamp > mid_block_timestamp:
            start_block = mid_block + 1
        else:
            return mid_block

    # If exact timestamp is not found, return the closest block
    if abs(web3.eth.get_block(start_block)['timestamp'] - target_timestamp) < abs(web3.eth.get_block(end_block)['timestamp'] - target_timestamp):
        return start_block
    else:
        return end_block

To use this function, you would first convert your dates to Unix timestamps (seconds since Jan 1, 1970). You can do this using Python's built-in datetime module:

In [5]:
def timestamp(date_string):
    """
    Convert a date string to a Unix timestamp.
    
    Args:
        date_string (str): The date string in 'YYYY-MM-DD' format.
    
    Returns:
        int: The Unix timestamp corresponding to the date string.
    """
    dt = datetime.strptime(date_string, "%Y-%m-%d")
    return int(dt.timestamp())


---
To implement the token transfer detection, you would need the `eth_abi` package to decode contract input data and a way to check if a given address is a contract address. However, please note that correctly detecting and classifying all Ethereum transaction types with 100% accuracy is a complex task and requires deep analysis of the input data and possibly contract state. Therefore, the following simplified approach may not catch every edge case.

In [12]:
def is_contract(web3, address):
    """
    Determine if an address is a contract address.
    """
    return web3.eth.get_code(address) != b''

The process to identify if a transaction is a simple Ether transfer, Token transfer, or Smart contract interaction is not straightforward because of the versatile nature of Ethereum transactions. 

1. **Simple Ether Transfers:** These are transactions where Ether is transferred from one account to another. These transactions can be identified by checking if the `input` data field is '0x' (i.e., empty) and the `to` field is not a contract address (it doesn't have associated bytecode).

2. **Token Transfers (ERC-20, ERC-721, etc.):** These are transactions where tokens (like USDT, DAI, etc.) are transferred. Token transfers follow a standard set of rules defined in the ERC-20 or ERC-721 specifications. One of the main methods in these specifications is `transfer(...)`. When this method is called, the `input` data field in the transaction starts with the method ID derived from the keccak256 hash of the method signature. For example, the `transfer(...)` method's ID is '0xa9059cbb'. So, you could identify ERC-20 token transfers by checking if the `input` field starts with this method ID. However, this check can yield false positives, as other contracts can use the same method ID for different purposes. Identifying ERC-721 (NFT) transfers can be more challenging, as multiple methods (`safeTransferFrom(...)`, `transferFrom(...)`, etc.) can be used to transfer tokens. 

3. **Smart Contract Interactions:** If the `input` data field is not '0x' (empty), and the `input` does not correspond to a standard ERC-20 or ERC-721 token transfer, you can classify the transaction as a smart contract interaction.

Please note that these checks provide a rough classification of the transactions but aren't perfect. For example, they don't account for cases where a single transaction involves both Ether and token transfers or multiple types of contract interactions. Further, they don't cover all token standards (like ERC-1155, which allows for both fungible and non-fungible tokens).

As a result, if you need a precise classification of transaction types, you might need to employ more sophisticated methods, such as analyzing the bytecode of contracts, tracking the state changes of known contracts, or using specialized services or libraries that provide this kind of analysis.

---

## Data Collection

For the data collection, we use Python's `concurrent.futures` library to parallelize the extraction of block data. The key here is to split the block range into smaller ranges, and handle each of these ranges concurrently using a separate thread.

In [13]:
from concurrent.futures import ThreadPoolExecutor

def extract_block_data(web3, block):
    """
    Extracts transaction data from a block.
    """
    ERC20_TRANSFER_SIGNATURE = '0xa9059cbb'
    block_data = web3.eth.get_block(block, full_transactions=True)
    data = []
    for tx in block_data['transactions']:
        # Define transaction type
        if tx['input'] == '0x' and not is_contract(web3, tx['to']):
            tx_type = 'Simple Ether Transfer'
        elif tx['input'][:10] == ERC20_TRANSFER_SIGNATURE:
            tx_type = 'Token Transfer'
        else:
            tx_type = 'Smart Contract Interaction'

        # Define EIP-1559 type
        eip_1559_type = 'Legacy' if tx['type'] in ['0x0', '0x1'] else 'EIP-1559' if tx['type'] == '0x2' else 'Unknown'

        tx_data = {
            'Transaction Identifier': tx['hash'].hex(),
            'Block Number': block,
            'Transaction Timestamp': block_data['timestamp'],
            'Transaction Status': 'Success' if tx['to'] is not None else 'Failed',
            'Gas Price': tx['gasPrice'],
            'Transaction Fee': tx['gas'] * tx['gasPrice'],
            'Sender\'s Address': tx['from'],
            'Transaction Type': tx_type,
            'Transaction EIP-1559 Type': eip_1559_type
        }
        data.append(tx_data)
    return data

def collect_data(web3, start_block, end_block):
    """
    Collect transaction data from a range of blocks on the Ethereum blockchain.
    """
    data = []

    with ThreadPoolExecutor() as executor:
        for block_data in executor.map(lambda block: extract_block_data(web3, block), range(start_block, end_block + 1)):
            data.extend(block_data)

    return data


This script above now identifies simple Ether transfers, token transfers, and smart contract interactions according to the descriptions. It uses the `getCode` function to check if the `to` address is a contract, and checks the first 4 bytes (8 hex characters) of the `input` field against the method ID for the ERC-20 `transfer` function to identify token transfers. If the `input` field is not empty and doesn't correspond to a token transfer, the transaction is classified as a smart contract interaction.

In this code, the `ThreadPoolExecutor` manages a pool of worker threads. The `executor.map` function is used to apply the `extract_block_data` function to each block number in the specified range. The result is a list of transaction data for each block, which is then appended to the data list.

This approach should significantly speed up data collection by utilizing multiple cores and threads on your machine. The actual speedup depends on the number of cores and threads your machine has, as well as other factors like network latency and the I/O performance of your machine.

---

## Steps to Collect Data

### Step 1: Initialize a Web3 instance with my Infura endpoint.

In [14]:
infura_url = os.getenv("INFURA_MAINNET_URL")

if not infura_url:
    raise ValueError("INFURA_URL is not set in the environment variables")

web3 = setup_web3(infura_url)
print(web3.is_connected())  # should return True if the connection is successful

True


### Step 2: Convert my start and end dates to Unix timestamps and then to block numbers.

In [15]:
# Run only once to get the start block and end block for the specified time period
start_date = "2021-08-04"
end_date = "2021-08-05"

start_timestamp = timestamp(start_date)
end_timestamp = timestamp(end_date)

start_block = get_block_number_by_timestamp(web3, start_timestamp)
end_block = get_block_number_by_timestamp(web3, end_timestamp)

print(f"The block number at {start_date} is {start_block}")
print(f"The block number at {end_date} is {end_block}")

The block number at 2021-08-04 is 12956450
The block number at 2021-08-05 is 12962754


### Step 3: Collect the transaction data

In [16]:
data = collect_data(web3, start_block, end_block)

HTTPError: 429 Client Error: Too Many Requests for url: https://mainnet.infura.io/v3/6f8d79ee5e7a4decb766180be7507176

### Step 4: Write the data to a CSV file

In [None]:
# Specify the CSV file name
file_name = "eth_transaction_data.csv"

# Write data to the CSV file
with open(file_name, 'w', newline='') as csvfile:
    fieldnames = list(data[0].keys()) if data else []
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    writer.writeheader()
    for row in data:
        writer.writerow(row)

In [None]:
def get_unique_senders(data):
    """
    Collect all unique sender addresses from a dataframe of transaction data.
    
    Args:
        data (DataFrame): A dataframe representing transaction data.
    
    Returns:
        DataFrame: A dataframe of unique sender addresses.
    """
    unique_senders = data['Sender\'s Address'].unique()
    return pd.DataFrame(unique_senders, columns=['Sender\'s Address'])

In [None]:
# Read the CSV file into a DataFrame
df = pd.read_csv('eth_transaction_data.csv')

In [None]:
df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace("'", "")
df.columns

In [None]:
# Get unique sender addresses
unique_senders = get_unique_senders(df)

In [None]:
# If you want to save the unique senders to a CSV file
unique_senders.to_csv('unique_senders.csv', index=False)

In [None]:
# Convert Unix timestamp to pandas Timestamp
df['transaction_timestamp'] = pd.to_datetime(df['transaction_timestamp'], unit='s')

# Sort by user address and timestamp
df.sort_values(['senders_address', 'transaction_timestamp'], inplace=True)

# Add a column for transaction count within the last 12 hours, initialized with 0
df['trans_freq'] = 0

# Initialize the dictionary to store transaction counts
transaction_counts = defaultdict(list)

# Loop over each row in the DataFrame
for i in range(len(df)):
    # Get the current user and transaction timestamp
    current_user = df.iloc[i]['senders_address']
    current_time = df.iloc[i]['transaction_timestamp']

    # Define the 12-hour window start time
    window_start = current_time - pd.Timedelta(hours=12)

    # Store transaction timestamps for each user
    transaction_counts[current_user].append(current_time)

    # Keep only transactions within the 12-hour window
    transaction_counts[current_user] = [ts for ts in transaction_counts[current_user] if ts >= window_start]

    # Set the transaction frequency for the current row to the number of transactions within the time window
    df.at[i, 'trans_freq'] = len(transaction_counts[current_user])


In [None]:
df[:10]

In [None]:
# Convert string to datetime and create a temporary column
df['temp_timestamp'] = pd.to_datetime(df['transaction_timestamp'])

# Set the launch dates for Layer 2 solutions
optimism_launch = datetime(2021, 1, 16)
arbitrum_launch = datetime(2021, 8, 31)

# Add the 'layer2_availability' column
df['layer2_availability'] = ((df['temp_timestamp'] >= arbitrum_launch) | (df['temp_timestamp'] >= optimism_launch)).astype(int)

# Add the 'post_eip1559' column. 
# EIP-1559 went live on August 5, 2021
eip1559_launch_date = datetime(2021, 8, 5)
df['post_eip1559'] = (df['temp_timestamp'] >= eip1559_launch_date).astype(int)

# Drop the temporary column
df = df.drop('temp_timestamp', axis=1)

In [None]:
df[:10]

---
1. **User transaction count**: The cumulative number of transactions made by the same user up to the current transaction.
2. **Failed transactions**: the cumulative number of failed transactions by the same user up to the current transaction

In [None]:
# Create a new column that is 1 if the transaction failed and 0 otherwise
df['is_failed'] = (df['transaction_status'] == 'False').astype(int)

# Group by user address and calculate the cumulative count of transactions and failed transactions
df = df.sort_values('transaction_timestamp')
df['user_transaction_count'] = df.groupby('senders_address').cumcount() + 1
df['failed_transactions'] = df.groupby('senders_address')['is_failed'].cumsum()

# You can drop the 'is_failed' column if you no longer need it
df = df.drop('is_failed', axis=1)
df[:10]

In this script, we first created a new column `is_failed` that is 1 if the transaction failed and 0 otherwise. We then sorted the dataframe by the transaction timestamp to ensure transactions are processed in the order they occurred. 

Next, we grouped the dataframe by the sender's address and used `cumcount` to get the cumulative count of transactions for each user (we added 1 because `cumcount` starts from 0). We also used `cumsum` on the `is_failed` column to get the cumulative count of failed transactions for each user. 

Finally, we dropped the `is_failed` column as it's no longer needed.

---

In [None]:
df.columns

---
Derive User Address Age

We are using 'eth.getTransactionCount' function from web3.py, which retrieves the number of transactions sent from an address up until a certain block. Using this function, we can find the block at which the address was first used.


In the script below, for each transaction in the dataframe, the script scans blocks from the genesis block to the block of the transaction. If the address has made any transactions up to a block, it means that this block is the first usage block of the address, so it's saved to the `first_usage` dictionary. The `get_transaction_count` function is used to get the number of transactions sent from an address up to a certain block. After finding the first usage block of each address, the script adds a new column 'address_age' to the data, which represents the age of the address at the time of each transaction.

The above solution is likely to be slow if you have a large number of unique addresses and transactions. It's because for each address, it scans blocks from the genesis block to the block of each transaction of the address. If you have a large number of unique addresses, the total number of blocks to scan can be enormous.

To optimize the calculation of user address age and utilize a high-performance computing cluster, we could use parallel computing techniques. This involves dividing the computation task into smaller jobs that can be run simultaneously across multiple processors or nodes in the cluster.

In Python, several libraries allow you to use parallel computing, such as `multiprocessing`, `concurrent.futures`, and `joblib`. Here's an example of how you could use the `concurrent.futures` library to parallelize the block scanning task. This script uses a `ThreadPoolExecutor` to run multiple block scanning tasks simultaneously, which should significantly speed up the process if you're running it on a machine with multiple CPU cores. However, the maximum number of concurrent tasks is limited by the number of CPU cores in your machine.

In [None]:
# Function to find the first usage block of an address
def find_first_usage(address, transaction_block):
    # Scan blocks from the start block to the transaction block
    for block in range(start_block, transaction_block):
        # If the address has made any transactions up to the block
        if web3.eth.get_transaction_count(address, block):
            return block
    return None

In [None]:
# Sort by block number
df.sort_values(['block_number'], inplace=True)

# Prepare a dictionary to store the first usage block of each address
first_usage = {}

# Define the range of blocks to scan
start_block = 0  # This should be set to the genesis block

# Create a ThreadPoolExecutor
with concurrent.futures.ThreadPoolExecutor() as executor:
    # For each unique address in the data
    for address in df['senders_address'].unique():
        transaction_block = df[df['senders_address'] == address]['block_number'].min()
        
        # If the first usage block of the address has already been found, skip this address
        if address in first_usage:
            continue
        
        # Submit a new task to the executor
        future = executor.submit(find_first_usage, address, transaction_block)
        
        # Store the Future object in the dictionary
        first_usage[address] = future

# Retrieve the results from the Future objects
for address, future in first_usage.items():
    first_usage[address] = future.result()

# Add a new column 'address_age' to the data
df['address_age'] = df['senders_address'].map(first_usage)

df[:10]

---

**Time of the day variable**

Due to the global nature of Ethereum transactions, the "time of day" variable could have different implications for users in different time zones, and this could indeed introduce complexity and potentially confounding effects into the analysis.

However, there may still be value in including a "time of day" variable, even in a global context, for several reasons:

1. **Network Effects**: Blockchain networks like Ethereum may experience periods of higher and lower congestion, which could align with certain times of day, despite global usage. For example, if a substantial proportion of Ethereum users are based in a particular region (say, North America or East Asia), then the network might be busier during the waking hours of that region.

2. **Market Activity**: Cryptocurrency markets operate 24/7 and market activity (trading volume, price volatility, etc.) can vary significantly across different times of the day, potentially influencing user behavior. For example, users might be more likely to engage in DeFi transactions during periods of high market activity.

3. **Behavioral Patterns**: Regardless of the global nature of Ethereum, there might be common daily behavioral patterns. For example, users might be more active during their daytime and less active during their nighttime, and these patterns could aggregate up to observable patterns in the data.

However, given the global nature, it may be advisable to construct the "time of day" variable in a way that captures potential global effects. For example, you could split the day into fewer, larger chunks (like morning, afternoon, evening, and night), or analyze this variable carefully in your exploratory analysis to understand its distribution and potential impacts.

Another approach might be to construct a variable that captures the "local time of day" for each transaction, assuming you can infer or have information about the geographic location of each user (which could introduce privacy issues and may not be feasible). But again, these are more complex and may not necessarily offer a better representation.

Finally, including "time of day" in your initial model does not obligate you to keep it in your final model. If exploratory analysis or preliminary model results suggest it's not meaningful or is causing issues, you can always exclude it in later iterations of your modeling process.

In conclusion, the "time of day" could potentially be a meaningful variable in your analysis, but its use and interpretation require careful consideration due to the global nature of Ethereum usage.

Given the global nature of Ethereum, a simple division of a day into distinct periods based on a specific time zone may not be the most representative. However, a potential solution could be to divide the day into a number of periods that are likely to capture significant changes in activity. For instance:

1. **Daytime**: 06:00 to 17:59
2. **Evening**: 18:00 to 21:59
3. **Night**: 22:00 to 05:59

This division attempts to capture typical working hours (daytime), after-work hours (evening), and sleeping hours (night). Of course, given the global nature of the network, these periods won't align perfectly with these times for all users, but they might serve as a useful approximation.

If it's possible to incorporate additional information, such as the geographic distribution of Ethereum users or the times of day that tend to see the most network activity, this could be used to refine these periods further.

You can then create a new variable, "TimeOfDay", in your dataset by mapping each transaction timestamp to one of these periods.

Here's a Python script to do that assuming your timestamp is in the form 'YYYY-MM-DD HH:MM:SS' and in UTC:

In [None]:
def assign_time_of_day(timestamp):
    hour = timestamp.hour
    if 6 <= hour < 18:
        return 'Daytime'
    elif 18 <= hour < 22:
        return 'Evening'
    else:
        return 'Night'

df['time_of_day'] = df['transaction_timestamp'].apply(assign_time_of_day)