# Transaction Data Collection from the Ethereum Blockchain

## EIP-1559
- Date: August 5, 2021
- Block number: 12,965,000
- [Ethereum JSON-RPC Specification](https://ethereum.github.io/execution-apis/api-documentation/)
- [JSON RPC API](https://ethereum.org/en/developers/docs/apis/json-rpc/)
- [EIP-1559 Analysis Arxiv](https://github.com/SciEcon/EIP1559)

## Layer 2 Solutions Launch Dates
Source: [L2BEAT](https://l2beat.com/scaling/tvl)
1. Optimism is live on: January 16, 2021
2. Arbitrum is live on: August 31, 2021

In [1]:
import datetime
import csv
import os
import re
import glob
from datetime import datetime
import time

import pandas as pd
import dask.dataframe as dd
from collections import defaultdict

In [None]:
def timestamp(date_string):
    """
    Convert a date string to a Unix timestamp.
    
    Args:
        date_string (str): The date string in 'YYYY-MM-DD' format.
    
    Returns:
        int: The Unix timestamp corresponding to the date string.
    """
    dt = datetime.strptime(date_string, "%Y-%m-%d")
    return int(dt.timestamp())


## 1. Merge Ethereum Individual Transaction Data

We had two sources of data for individual transaction data:
1. Full Ethereum Node I run in my office
2. Google BigQuery

I collected the data first using my Ethereum node. It tool about 12 hours to collect six months before and six months after EIP-1559. But with Google BigQuery, this process is much faster using MySQL. It was a matter of minutes. But it does take time to download the data but at the end Google BigQuery is faster in terms of data collection. 

### 1.1 Merge data extratced from Google BigQuery

In [4]:
# The directory containing your csv files
data_dir = "../data/eth_transaction_data/tx-data/"

# Filename pattern
filename_pattern = "*.csv"

# Find all filenames in the directory
all_files = os.listdir(data_dir)

# Flag to indicate whether it's the first file
first_file = True

with open('../data/eth_transaction_data.csv', 'a') as singleFile:
    for filename in all_files:
        # Create the full file path by joining the directory with the filename
        full_file_path = os.path.join(data_dir, filename)
        
        df = pd.read_csv(full_file_path, dtype={'receipt_contract_address': str})
        # Write data to file
        if first_file:  # If it's the first file
            df.to_csv(singleFile, header=True)  # Write with header
            first_file = False  # After the first file, set this flag to False
        else:
            df.to_csv(singleFile, header=False, mode='a')  # If not the first file, write without header

  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(full_file_path)
  df = pd.read_csv(f

### 1.2 Merge data extracted from Ethereum full node

In [None]:
# Directory where your files are stored
data_dir = "../data/"

# Filename pattern
filename_pattern = "eth_transaction_data_{}.csv"

# Find all filenames in the directory
all_files = os.listdir(data_dir)

# Extract dates from filenames and convert to datetime
dates = [datetime.strptime(re.search(r'\d{4}-\d{2}-\d{2}', file).group(), "%Y-%m-%d") for file in all_files if re.search(r'\d{4}-\d{2}-\d{2}', file)]

# Find start and end dates
start_date = min(dates)
end_date = max(dates)

print(start_date, end_date)

In this script above:

1. The `os.listdir` function is used to retrieve all the filenames in the directory.
2. The `re.search` function is used to extract the date strings from the filenames using a regular expression that matches the date format (yyyy-mm-dd).
3. The `datetime.strptime` function is used to convert the date strings to datetime objects.
4. The `min` and `max` functions are used to find the start and end dates.

In [None]:
# Flag to indicate whether it's the first file
first_file = True

# Create or open the final CSV file in append mode
with open('../data/merged_eth_transaction_data.csv', 'a') as singleFile:
    for single_date in pd.date_range(start_date, end_date):
        filename = os.path.join(data_dir, filename_pattern.format(single_date.strftime("%Y-%m-%d")))
        
        if os.path.isfile(filename):  # if the file exists
            df = pd.read_csv(filename, dtype={5: float})
            # Write data to file
            if first_file:  # If it's the first file
                df.to_csv(singleFile, header=True)  # Write with header
                first_file = False  # After the first file, set this flag to False
            else:
                df.to_csv(singleFile, header=False, mode='a')  # If not the first file, write without header

In this script for merging transtion data:
1. It iterates over the date range, reads the data for each date into a DataFrame, and appends it to the final CSV file.
2. a flag `first_file` is used to check if the current file is the first one. If it is, the script writes the DataFrame to the CSV file with headers. For subsequent files, the DataFrame is written without headers. The `mode='a'` argument to `to_csv` is used to append the data to the existing file.

## Extract Unique Sender Addresses from Ethereum Transaction Data
The following Pyhton function can be used to extract unique sender addresses from Ethereum transaction data.

In [None]:
def extract_unique_senders(input_file, output_file, sender_column):
    """
    Function to extract unique sender addresses from a large CSV file of transaction data using Dask.

    Parameters:
    input_file (str): Path to the input CSV file representing transaction data.
    output_file (str): Path to the output CSV file where unique sender addresses will be saved.
    sender_column (str): The name of the column in the input file that contains the sender's addresses.

    Returns:
    None
    """

    # Read in data using Dask's read_csv function
    # Dask is a parallel computing library that allows us to work with large datasets
    # The read_csv function works similarly to pandas' read_csv, but it performs the operations lazily
    df = dd.read_csv(input_file)

    # Get unique sender's addresses
    # drop_duplicates returns the unique values in the sender_column
    unique_senders = df[sender_column].drop_duplicates()

    # Compute the result and save to a new CSV file
    # compute() performs the actual computation and returns a pandas DataFrame
    # to_csv writes the DataFrame to a CSV file
    unique_senders.compute().to_csv(output_file, index=False)

In [None]:
# Usage
extract_unique_senders('../data/eth_transaction_data.csv', 'unique_senders.csv', 'Sender\'s Address')

In [None]:
# Directory where your files are stored
data_dir = "../data/"

# Read the CSV file into a DataFrame
df = dd.read_csv(f"{data_dir}unique_senders.csv")
# Format column names to replace spaces with '_' and replace single quotes with "".
df = df.rename(columns={col: col.lower().replace(' ', '_').replace("'", "") for col in df.columns})
# Overwrite the original CSV file
df.to_csv(f"{data_dir}unique_senders.csv", index=False, single_file=True)