# Ethereum mempool Data Collection

[Jochen Hoenicke](https://jochen-hoenicke.de/) started collecting data in December 2020. Please read his own words below:

> I started collecting around December 2020 more than half a
year before EIP-1559. You can see it visually by using the all range
and then zooming in.
You can also download the raw data at
>https://johoe.jochen-hoenicke.de/queue/ethereum/mempool.log

>The data itself is in jsonp, with one line per minute.  Each line
starts with timestamp (unix time: seconds since 1970) followed by
three arrays for the three charts (in the order txcount, txsize, fee).
Each array contains one number for each fee range, the ranges are
given in the config in mempool.js.

>The file https://mempool.jhoenicke.de/mempool.js starts with a config
array, with one entry for every server (BTC, BCH, LTC, etc).   In this
you can find the fee ranges configured for ETH.

>I think I also had a backup server running at that time, with data at
https://jochen-hoenicke.de/queue/eth/mempool.log
The data should now be identical, but I think at that time it was
generated from a different server.

In [1]:
# Import libraries
import requests
import pandas as pd
import json
import csv
import ast
import datetime

## Download mempool Data

In [None]:
# Define the URL
url = "https://johoe.jochen-hoenicke.de/queue/ethereum/mempool.log"

# Send a GET request to the URL and store the response object
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Get the raw data as bytes
    data = response.content
    # Write the data to a file
    filename = f"../data/mempool.csv"
    with open(filename, "wb") as f:
        f.write(data)
    # Print a success message
    print("Data downloaded successfully!")
else:
    # Print an error message
    print(f"Request failed with status code {response.status_code}")

## Download mempool Backup Data

In [None]:
# Define the URL
url = "https://jochen-hoenicke.de/queue/eth/mempool.log" # Backup server

# Send a GET request to the URL and store the response object
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Get the raw data as bytes
    data = response.content
    # Write the data to a file
    filename = f"../data/mempool_backup.csv"
    with open(filename, "wb") as f:
        f.write(data)
    # Print a success message
    print("Data downloaded successfully!")
else:
    # Print an error message
    print(f"Request failed with status code {response.status_code}")

## Preprocess the mempool data

In [None]:
# Load the data
df = pd.read_csv("../data/mempool.csv", header=None)

# Remove the last column
df.drop(df.columns[-1], axis=1, inplace=True)

# Clean and processUNIX timestamp
df[0] = df[0].str.strip("'[]")
df[0] = df[0].astype(int)

# Create a new column with datetime values
df.insert(1, 'date_time', df[0].apply(lambda x: datetime.datetime.fromtimestamp(x).strftime('%Y-%m-%d %H:%M:%S')))

# Clean and process txcount, txsize, and fees
for i in range(2, 5):
    df[i] = df[i].apply(lambda x: [float(j.strip("]")) for j in str(x).strip('[]').split(',') if j.strip() != ''])

# Derive number of recordings each minute
n = (df.shape[1] - 2) // 3

timestamp_columns = ['timestamp', 'date_time']
txcount_columns = ['txcount_' + str(i+1) for i in range(n)]
txsize_columns = ['txsize_' + str(i+1) for i in range(n)]
fee_columns = ['fee_' + str(i+1) for i in range(n)]
column_names = timestamp_columns + txcount_columns + txsize_columns + fee_columns

# Renaming columns
df.columns = column_names

# Task (8)
df.to_csv("../data/cleaned_mempool.csv", index=False)


## Final Process the mempool Data
Somehow there are still trailing square brackets after preprocessing. The script below deletes all those trailing brackets.

In [None]:
def clean_tx(tx):
    tx = tx.replace('[', '')
    tx = tx.replace(']', '')
    return tx

def clean_csv(input_file_path, output_file_path):
    with open(input_file_path, 'r') as input_file, open(output_file_path, 'w', newline='') as output_file:
        reader = csv.reader(input_file)
        writer = csv.writer(output_file)
        for row in reader:
            cleaned_row = [clean_tx(cell) for cell in row]
            writer.writerow(cleaned_row)

In [None]:
# Final process
clean_csv("../data/cleaned_mempool.csv", "../data/final_mempool.csv")

## mempool.js for config

The `mempool` data itself is in jsonp, with one line per minute.  Each line
starts with timestamp (unix time: seconds since 1970) followed by
three arrays for the three charts (in the order txcount, txsize, fee).
Each array contains one number for each fee range, the ranges are
given in the config in mempool.js.

The file https://mempool.jhoenicke.de/mempool.js starts with a config
array, with one entry for every server (BTC, BCH, LTC, etc).   In this
you can find the fee ranges configured for ETH.

In [2]:
import requests

url = "https://mempool.jhoenicke.de/mempool.js"
response = requests.get(url)

# Make sure the request was successful
if response.status_code == 200:
    # Open the file in write mode
    with open('../data/mempool.js', 'w') as f:
        # Write the contents of the response to the file
        f.write(response.text)
else:
    print(f"Failed to download file. Status code: {response.status_code}")