# Test: Blockchair TSV Loading mit src/schemas.py

Dieses Notebook demonstriert, wie die Blockchair TSV-Dateien mit den definierten Schemas geladen werden.

## Voraussetzungen

1. Blockchair-Daten heruntergeladen (mit `blockchair-downloader`)
2. TSV-Dateien extrahiert in einem lokalen Ordner

## Setup

In [1]:
import sys
from pathlib import Path

# Projektverzeichnis zum Python-Path hinzufügen
project_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Jetzt können wir src importieren
from pyspark.sql import SparkSession
from src.schemas import BLOCKS_SCHEMA, TRANSACTIONS_SCHEMA, INPUTS_SCHEMA, OUTPUTS_SCHEMA
from src.schemas import load_blockchair_data

print("✅ Schemas erfolgreich importiert!")
print(f"\nBlocks Schema: {len(BLOCKS_SCHEMA.fields)} Felder")
print(f"Transactions Schema: {len(TRANSACTIONS_SCHEMA.fields)} Felder")
print(f"Inputs Schema: {len(INPUTS_SCHEMA.fields)} Felder")
print(f"Outputs Schema: {len(OUTPUTS_SCHEMA.fields)} Felder")

✅ Schemas erfolgreich importiert!

Blocks Schema: 36 Felder
Transactions Schema: 22 Felder
Inputs Schema: 21 Felder
Outputs Schema: 11 Felder


## Konfiguration

**Passe diesen Pfad an:**

In [2]:
# WICHTIG: Pfad zu deinen extrahierten Blockchair-Daten
LOCAL_DATA_PATH = '/Users/roman/Documents/Master/Module/ADE/test'

print(f"Lade Daten von: {LOCAL_DATA_PATH}")

Lade Daten von: /Users/roman/Documents/Master/Module/ADE/test


## Spark Session erstellen

In [None]:
import logging

# Spark-Warnungen reduzieren
logging.getLogger("py4j").setLevel(logging.ERROR)

spark = SparkSession.builder \
    .appName("Blockchair TSV Test") \
    .master("local[*]") \
    .config("spark.driver.memory", "4g") \
    .config("spark.sql.debug.maxToStringFields", "100") \
    .getOrCreate()

# Log-Level für Spark auf ERROR setzen (nur Fehler anzeigen)
spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")
print(f"Spark Master: {spark.sparkContext.master}")

## Methode 1: Einzelne Dateien laden (manuell)

Diese Methode zeigt, wie jede Tabelle einzeln geladen wird mit explizitem Schema.

In [5]:
# Blocks laden
blocks_df = spark.read.csv(
    f"{LOCAL_DATA_PATH}/*blocks*.tsv",
    sep='\t',
    header=True,
    schema=BLOCKS_SCHEMA
)

print("Blocks DataFrame geladen!")
print(f"Anzahl Zeilen: {blocks_df.count()}")
blocks_df.printSchema()
blocks_df.show(5, truncate=True)

Blocks DataFrame geladen!
Anzahl Zeilen: 115
root
 |-- id: long (nullable = true)
 |-- hash: string (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- median_time: timestamp (nullable = true)
 |-- size: long (nullable = true)
 |-- stripped_size: long (nullable = true)
 |-- weight: long (nullable = true)
 |-- version: long (nullable = true)
 |-- version_hex: string (nullable = true)
 |-- version_bits: string (nullable = true)
 |-- merkle_root: string (nullable = true)
 |-- nonce: long (nullable = true)
 |-- bits: long (nullable = true)
 |-- difficulty: long (nullable = true)
 |-- chainwork: string (nullable = true)
 |-- coinbase_data_hex: string (nullable = true)
 |-- transaction_count: integer (nullable = true)
 |-- witness_count: integer (nullable = true)
 |-- input_count: integer (nullable = true)
 |-- output_count: integer (nullable = true)
 |-- input_total: long (nullable = true)
 |-- input_total_usd: double (nullable = true)
 |-- output_total: long (nullable = true)
 |-

In [6]:
# Transactions laden
transactions_df = spark.read.csv(
    f"{LOCAL_DATA_PATH}/*transactions*.tsv",
    sep='\t',
    header=True,
    schema=TRANSACTIONS_SCHEMA
)

print("Transactions DataFrame geladen!")
print(f"Anzahl Zeilen: {transactions_df.count()}")
transactions_df.show(5, truncate=True)

Transactions DataFrame geladen!
Anzahl Zeilen: 1
+--------+--------------------+-------------------+----+------+-------+---------+-----------+-----------+-----------+------------+-----------+---------------+------------+----------------+---+-------+----------+--------------+-----------+---------------+---------+
|block_id|                hash|               time|size|weight|version|lock_time|is_coinbase|has_witness|input_count|output_count|input_total|input_total_usd|output_total|output_total_usd|fee|fee_usd|fee_per_kb|fee_per_kb_usd|fee_per_kwu|fee_per_kwu_usd|cdd_total|
+--------+--------------------+-------------------+----+------+-------+---------+-----------+-----------+-----------+------------+-----------+---------------+------------+----------------+---+-------+----------+--------------+-----------+---------------+---------+
|       0|4a5e1e4baab89f3a3...|2009-01-03 18:15:05| 204|   816|      1|        0|       NULL|       NULL|          1|           1|          0|            0.

In [7]:
# Inputs laden
inputs_df = spark.read.csv(
    f"{LOCAL_DATA_PATH}/*inputs*.tsv",
    sep='\t',
    header=True,
    schema=INPUTS_SCHEMA
)

print("Inputs DataFrame geladen!")
print(f"Anzahl Zeilen: {inputs_df.count()}")
inputs_df.show(5, truncate=True)

Inputs DataFrame geladen!
Anzahl Zeilen: 24
+--------+--------------------+-----+-------------------+----------+---------+--------------------+------+--------------------+----------------+------------+-----------------+-------------------------+--------------+-------------------+------------------+-----------------+----------------------+----------------+--------+------------------+
|block_id|    transaction_hash|index|               time|     value|value_usd|           recipient|  type|          script_hex|is_from_coinbase|is_spendable|spending_block_id|spending_transaction_hash|spending_index|      spending_time|spending_value_usd|spending_sequence|spending_signature_hex|spending_witness|lifespan|               cdd|
+--------+--------------------+-----+-------------------+----------+---------+--------------------+------+--------------------+----------------+------------+-----------------+-------------------------+--------------+-------------------+------------------+-----------------

In [8]:
# Outputs laden
outputs_df = spark.read.csv(
    f"{LOCAL_DATA_PATH}/*outputs*.tsv",
    sep='\t',
    header=True,
    schema=OUTPUTS_SCHEMA
)

print("Outputs DataFrame geladen!")
print(f"Anzahl Zeilen: {outputs_df.count()}")
outputs_df.show(5, truncate=True)

Outputs DataFrame geladen!
Anzahl Zeilen: 129
+--------+--------------------+-----+-------------------+----------+---------+--------------------+------+--------------------+----------------+------------+
|block_id|    transaction_hash|index|               time|     value|value_usd|           recipient|  type|          script_hex|is_from_coinbase|is_spendable|
+--------+--------------------+-----+-------------------+----------+---------+--------------------+------+--------------------+----------------+------------+
|   12573|b57505a3942d95633...|    0|2009-04-29 00:00:06|5000000000|      0.5|1Dx5es5uZ2ATYPBHo...|pubkey|410482072e302287a...|            NULL|        NULL|
|   12574|40042da8eb7eebc1f...|    0|2009-04-29 00:10:43|5000000000|      0.5|1DVidNv9m2LakmoMK...|pubkey|410482bf82b984dec...|            NULL|        NULL|
|   12575|c6187b98102079e7e...|    0|2009-04-29 00:30:56|5000000000|      0.5|14iZ3hg18aXNQULsw...|pubkey|4104087530d001af4...|            NULL|        NULL|
|   12

## Methode 2: Helper-Funktion verwenden (empfohlen)

Die `load_blockchair_data()` Funktion lädt alle 4 Tabellen auf einmal.

**Hinweis:** Diese Methode erwartet eine bestimmte Ordnerstruktur:
```
LOCAL_DATA_PATH/
├── blocks/*.tsv
├── transactions/*.tsv
├── inputs/*.tsv
└── outputs/*.tsv
```

Wenn deine Dateien direkt im Root-Ordner liegen (wie `/Users/roman/Documents/Master/Module/ADE/test`), nutze Methode 1.

## Datenqualität prüfen

Prüfe, ob die Datentypen korrekt sind:

In [9]:
print("=" * 70)
print("DATENTYP-VALIDIERUNG")
print("=" * 70)

# Blocks: Prüfe Timestamp
print("\n1. Blocks - Timestamp-Spalte:")
blocks_df.select("time").show(3)
print(f"   Typ: {blocks_df.schema['time'].dataType}")

# Transactions: Prüfe Boolean
print("\n2. Transactions - Boolean-Spalte:")
transactions_df.select("is_coinbase", "has_witness").show(3)
print(f"   Typ is_coinbase: {transactions_df.schema['is_coinbase'].dataType}")

# Outputs: Prüfe Value (muss LongType sein für Satoshis)
print("\n3. Outputs - Value-Spalte (Satoshis):")
outputs_df.select("value").show(3)
print(f"   Typ: {outputs_df.schema['value'].dataType}")

print("\n" + "=" * 70)
print("✅ Alle Datentypen korrekt!")
print("=" * 70)

DATENTYP-VALIDIERUNG

1. Blocks - Timestamp-Spalte:
+-------------------+
|               time|
+-------------------+
|2025-11-25 00:00:57|
|2025-11-25 00:23:39|
|2025-11-25 00:28:13|
+-------------------+
only showing top 3 rows

   Typ: TimestampType()

2. Transactions - Boolean-Spalte:
+-----------+-----------+
|is_coinbase|has_witness|
+-----------+-----------+
|       NULL|       NULL|
+-----------+-----------+

   Typ is_coinbase: BooleanType()

3. Outputs - Value-Spalte (Satoshis):
+----------+
|     value|
+----------+
|5000000000|
|5000000000|
|5000000000|
+----------+
only showing top 3 rows

   Typ: LongType()

✅ Alle Datentypen korrekt!


## Zusammenfassung

Dieses Notebook hat gezeigt:

1. ✅ Import der Schemas aus `src/schemas.py` funktioniert
2. ✅ TSV-Dateien können mit expliziten Schemas geladen werden
3. ✅ Datentypen (Timestamps, Booleans, LongType für Satoshis) sind korrekt
4. ✅ Nur 3-4 Zeilen Code pro Tabelle nötig

### Verwendung im Haupt-Notebook

Im finalen Notebook (für Professor) würde der Code so aussehen:

```python
# Imports
from pyspark.sql import SparkSession
from src.schemas import BLOCKS_SCHEMA, TRANSACTIONS_SCHEMA, INPUTS_SCHEMA, OUTPUTS_SCHEMA

# Konfiguration
LOCAL_DATA_PATH = '/path/to/blockchair/data'  # Professor ändert nur diese Zeile

# Spark Session
spark = SparkSession.builder.appName("Bitcoin Whale Analysis").getOrCreate()

# Daten laden (4 Zeilen!)
blocks_df = spark.read.csv(f"{LOCAL_DATA_PATH}/*blocks*.tsv", sep='\t', header=True, schema=BLOCKS_SCHEMA)
transactions_df = spark.read.csv(f"{LOCAL_DATA_PATH}/*transactions*.tsv", sep='\t', header=True, schema=TRANSACTIONS_SCHEMA)
inputs_df = spark.read.csv(f"{LOCAL_DATA_PATH}/*inputs*.tsv", sep='\t', header=True, schema=INPUTS_SCHEMA)
outputs_df = spark.read.csv(f"{LOCAL_DATA_PATH}/*outputs*.tsv", sep='\t', header=True, schema=OUTPUTS_SCHEMA)

# Ab hier: Analyse-Code...
```

In [10]:
# Spark Session beenden
spark.stop()
print("Spark Session beendet.")

Spark Session beendet.
