# VDEH Data Fusion Pipeline

**Fokus:** KI-gest√ºtzte Fusion von VDEH und DNB Daten + Dual-Source Language Fusion

## üéØ Ziel
- Intelligente Fusion von VDEH-Original und DNB-Daten
- Konfliktaufl√∂sung via Ollama LLM
- **Dual-Source Language Fusion**: MARC21 Sprache + langdetect Erkennung
- Vollst√§ndige Nachvollziehbarkeit aller Entscheidungen
- Qualit√§tsverbesserung durch Datenanreicherung

## üìö Input/Output
- **Input**: `data/vdeh/processed/04_dnb_enriched_data.parquet`
- **Output**: `data/vdeh/processed/05_fused_data.parquet`

## ü§ñ KI-Modell
- **Ollama**: Lokales LLM (llama3.3:70b)
- **API**: http://localhost:11434

## üîÑ Fusion-Architektur

**Drei Fusion-Strategien:**
1. **Keine DNB-Daten** ‚Üí VDEH behalten
2. **Keine Konflikte** ‚Üí Einfacher Merge (VDEH priorisiert, DNB erg√§nzt)
3. **Konflikte vorhanden** ‚Üí KI-Entscheidung via Ollama

**Vollst√§ndige Nachvollziehbarkeit:**
- `fusion_*_source`: Welche Quelle f√ºr jedes Feld
- `fusion_conflicts`: JSON mit allen erkannten Konflikten
- `fusion_ai_reasoning`: KI-Begr√ºndung der Entscheidung

In [1]:
# üõ†Ô∏è SETUP UND DATEN LADEN
import sys
from pathlib import Path
import pandas as pd
import json

# Add src to path (temporary until utils is imported)
project_root = Path.cwd()
while not (project_root / 'config.yaml').exists() and project_root.parent != project_root:
    project_root = project_root.parent
sys.path.insert(0, str(project_root / 'src'))

# Now use the utility function
from utils.notebook_utils import setup_notebook

project_root, config = setup_notebook()
print(f"‚úÖ Project root: {project_root}")
print(f"‚úÖ Project: {config.get('project.name')} v{config.get('project.version')}")

2025-12-25 22:02:29 - utils.notebook_utils - INFO - Searching for project root...


2025-12-25 22:02:29 - utils.notebook_utils - INFO - Project root found: /media/sz/Data/Bibo/analysis


2025-12-25 22:02:29 - utils.notebook_utils - INFO - Loading configuration...


2025-12-25 22:02:29 - config_loader - INFO - Configuration loaded from /media/sz/Data/Bibo/analysis/config.yaml


2025-12-25 22:02:29 - utils.notebook_utils - INFO - Configuration loaded successfully: Dual-Source Bibliothek Bestandsvergleich


‚úÖ Project root: /media/sz/Data/Bibo/analysis
‚úÖ Project: Dual-Source Bibliothek Bestandsvergleich v2.2.0


In [2]:
# üìÇ DNB-ANGEREICHERTE DATEN LADEN
processed_dir = config.project_root / config.get('paths.data.vdeh.processed')
input_path = processed_dir / '04_dnb_enriched_data.parquet'
metadata_path = processed_dir / '04_metadata.json'

if not input_path.exists():
    raise FileNotFoundError(f"Input-Datei nicht gefunden: {input_path}\n"
                          "Bitte f√ºhren Sie zuerst 04_vdeh_data_enrichment.ipynb aus.")

# Daten laden
df_enriched = pd.read_parquet(input_path)

# Vorherige Metadaten laden
with open(metadata_path, 'r') as f:
    prev_metadata = json.load(f)

print(f"üìÇ Daten geladen aus: {input_path}")
print(f"üìä Records: {len(df_enriched):,}")
print(f"üíæ Memory: {df_enriched.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# DNB-Daten Statistiken
if 'dnb_query_method' in df_enriched.columns:
    dnb_records = df_enriched['dnb_query_method'].notna().sum()
    print(f"\nüìä DNB-Daten vorhanden: {dnb_records:,} ({dnb_records/len(df_enriched)*100:.1f}%)")
    
    method_counts = df_enriched['dnb_query_method'].value_counts()
    for method, count in method_counts.items():
        print(f"   {method}: {count:,}")

üìÇ Daten geladen aus: /media/sz/Data/Bibo/analysis/data/vdeh/processed/04_dnb_enriched_data.parquet
üìä Records: 58,305


üíæ Memory: 80.9 MB

üìä DNB-Daten vorhanden: 5,855 (10.0%)
   ISBN: 5,855


In [None]:
# üìã FUSION-SETUP
from fusion import OllamaClient, FusionEngine

print("üìã === FUSION-SETUP ===\n")

# Ollama-Client initialisieren
ollama_client = OllamaClient(
    api_url="http://localhost:11434/api/generate",
    model="llama3.3:70b",
    timeout_sec=220,
    max_retries=4,
    retry_backoff_base_sec=2,
    abort_on_timeout=True,
    enable_fallback=True,
    fallback_model="llama3.2"
)

# Test connection
if ollama_client.test_connection():
    print(f"‚úÖ Ollama verbunden: {ollama_client.model}")
else:
    raise RuntimeError("‚ùå Ollama nicht erreichbar! Stellen Sie sicher, dass Ollama l√§uft: ollama serve")

# Fusion-Engine initialisieren
fusion_engine = FusionEngine(
    ollama_client=ollama_client,
    variant_priority=["id", "title_author"]
)

print(f"‚öôÔ∏è  Timeout: {ollama_client.timeout_sec}s | Retries: {ollama_client.max_retries}")
print(f"ü§ñ Aktives Modell: {ollama_client.model}\n")

In [None]:
# üöÄ FUSION AUSF√úHREN
from tqdm.auto import tqdm
from fusion import OllamaUnavailableError

print("üöÄ === FUSION AUSF√úHREN ===\n")

# Configuration
RESET_FUSION = False  # Set to True to reset all fusion results
SAVE_INTERVAL = 50    # Save progress every N records

# Optional limit for testing
FUSION_LIMIT = None
try:
    FUSION_LIMIT = int(config.get('debug.fusion_limit', 0))
    if FUSION_LIMIT <= 0:
        FUSION_LIMIT = None
except Exception:
    FUSION_LIMIT = None

# Progress tracking files
progress_file = processed_dir / '05_fused_data_progress.parquet'
retry_queue_file = processed_dir / '05_fused_retry_queue.json'

# Reset if requested
if RESET_FUSION:
    if progress_file.exists():
        progress_file.unlink()
    if retry_queue_file.exists():
        retry_queue_file.unlink()
    print("üóëÔ∏è Fusion-Ergebnisse zur√ºckgesetzt\n")

# WICHTIG: Progress-Datei hat Vorrang √ºber Input-Datei
# Wenn wir resumed, laden wir den kompletten Zwischenstand
if not RESET_FUSION and progress_file.exists():
    print("üìÇ Lade Fortschritt aus Progress-Datei...")
    df_enriched = pd.read_parquet(progress_file)
    
    if not df_enriched.index.is_unique:
        df_enriched = df_enriched[~df_enriched.index.duplicated(keep='last')]
    
    # Bestimme bereits fusionierte Records
    if 'fusion_title_source' in df_enriched.columns:
        already_fused = set(df_enriched[df_enriched['fusion_title_source'].notna()].index)
    else:
        already_fused = set()
    
    print(f"   ‚úÖ {len(already_fused):,} Records bereits fusioniert")
    print(f"   ‚úÖ Gesamte Daten wiederhergestellt aus Progress-Datei")
else:
    already_fused = set()

# Statistics BEFORE fusion
print("\nüìä Vollst√§ndigkeit VOR Fusion:")
before_stats = {
    'title': df_enriched['title'].notna().sum(),
    'authors': (df_enriched['authors_str'].notna() & (df_enriched['authors_str'] != '')).sum(),
    'year': df_enriched['year'].notna().sum(),
    'publisher': df_enriched['publisher'].notna().sum()
}
for field, count in before_stats.items():
    print(f"   {field}: {count:,} ({count/len(df_enriched)*100:.1f}%)")

# Identify records to process (those with any DNB variant)
has_id = df_enriched[['dnb_title','dnb_authors','dnb_year','dnb_publisher']].notna().any(axis=1) if 'dnb_title' in df_enriched.columns else False
has_ta = df_enriched[['dnb_title_ta','dnb_authors_ta','dnb_year_ta','dnb_publisher_ta']].notna().any(axis=1) if 'dnb_title_ta' in df_enriched.columns else False
records_to_process = df_enriched[has_id | has_ta].copy()

total_with_dnb = len(records_to_process)
print(f"\nüîÑ Records mit DNB-Varianten: {total_with_dnb:,}")

# Filter already processed
records_to_process = records_to_process[~records_to_process.index.isin(already_fused)]

# Apply limit if in test mode
if FUSION_LIMIT and FUSION_LIMIT > 0:
    print(f"üß™ Testmodus aktiv ‚Äì verarbeite nur die ersten {FUSION_LIMIT} Records.")
    records_to_process = records_to_process.head(FUSION_LIMIT)

# Load and prioritize retry queue
retry_indices = []
if retry_queue_file.exists():
    try:
        with open(retry_queue_file, 'r', encoding='utf-8') as f:
            retry_indices = json.load(f)
    except Exception:
        retry_indices = []

retry_indices = [i for i in retry_indices if i in records_to_process.index]
if len(retry_indices) > 0:
    print(f"üîÅ Retry-Queue: {len(retry_indices):,} Records werden zuerst verarbeitet")
    retry_df = records_to_process.loc[records_to_process.index.isin(retry_indices)]
    fresh_df = records_to_process.loc[~records_to_process.index.isin(retry_indices)]
    records_to_process = pd.concat([retry_df, fresh_df], axis=0)

print(f"üîÑ Verbleibende Records: {len(records_to_process):,}")
print(f"   (Bereits fusioniert: {len(already_fused):,})\n")

# Initialize statistics
fusion_stats = {
    'total_processed': len(already_fused),
    'conflicts_found': 0,
    'dnb_preferred': 0,
    'simple_merges': 0,
    'errors': 0,
    'dnb_matches_rejected': 0,
    'ai_decisions': 0,
    'variant_id': 0,
    'variant_title_author': 0,
    'variant_none': 0
}

fusion_count = 0
aborted = False

# Main fusion loop
for idx, row in tqdm(records_to_process.iterrows(), total=len(records_to_process), desc="üîÑ Fusion", unit="records"):
    try:
        # Perform fusion
        result = fusion_engine.merge_record(row)
        result_dict = result.to_dict()
        
        # Update statistics
        variant = result_dict.get('dnb_variant_selected')
        if variant == 'id':
            fusion_stats['variant_id'] += 1
        elif variant == 'title_author':
            fusion_stats['variant_title_author'] += 1
        else:
            fusion_stats['variant_none'] += 1
        
        # Store results in DataFrame
        df_enriched.loc[idx, 'title'] = result_dict.get('title')
        df_enriched.loc[idx, 'authors_str'] = result_dict.get('authors')
        
        # Convert year to numeric
        year_val = result_dict.get('year')
        if pd.notna(year_val):
            try:
                df_enriched.loc[idx, 'year'] = pd.to_numeric(year_val, errors='coerce')
            except:
                df_enriched.loc[idx, 'year'] = year_val
        
        df_enriched.loc[idx, 'publisher'] = result_dict.get('publisher')
        df_enriched.loc[idx, 'fusion_title_source'] = result_dict.get('title_source')
        df_enriched.loc[idx, 'fusion_authors_source'] = result_dict.get('authors_source')
        df_enriched.loc[idx, 'fusion_year_source'] = result_dict.get('year_source')
        df_enriched.loc[idx, 'fusion_publisher_source'] = result_dict.get('publisher_source')
        df_enriched.loc[idx, 'fusion_conflicts'] = result_dict.get('conflicts')
        df_enriched.loc[idx, 'fusion_confirmations'] = result_dict.get('confirmations')
        df_enriched.loc[idx, 'fusion_ai_reasoning'] = result_dict.get('ai_reasoning')
        df_enriched.loc[idx, 'fusion_dnb_match_rejected'] = result_dict.get('dnb_match_rejected', False)
        df_enriched.loc[idx, 'fusion_rejection_reason'] = result_dict.get('rejection_reason')
        df_enriched.loc[idx, 'fusion_dnb_variant_selected'] = result_dict.get('dnb_variant_selected')
        
        # Clear retry flag if set
        if 'fusion_needs_retry' in df_enriched.columns:
            df_enriched.loc[idx, 'fusion_needs_retry'] = False
        if idx in retry_indices:
            retry_indices = [i for i in retry_indices if i != idx]
        
        # Update statistics
        fusion_stats['total_processed'] += 1
        fusion_count += 1
        fusion_stats['ai_decisions'] += 1
        
        if result_dict.get('dnb_match_rejected'):
            fusion_stats['dnb_matches_rejected'] += 1
        elif result_dict.get('conflicts'):
            fusion_stats['conflicts_found'] += 1
            fusion_stats['dnb_preferred'] += 1
        else:
            fusion_stats['simple_merges'] += 1
        
        # Incremental save
        if fusion_count % SAVE_INTERVAL == 0:
            df_enriched.to_parquet(progress_file, index=True)
            with open(retry_queue_file, 'w', encoding='utf-8') as f:
                json.dump(retry_indices, f, ensure_ascii=False, indent=2)
            print(f"\nüíæ Zwischenstand: {fusion_stats['total_processed']:,} Records fusioniert")
    
    except OllamaUnavailableError as e:
        print(f"\n‚ùå Ollama nicht erreichbar: {e}")
        print("üëâ Record wird in die Retry-Queue gelegt")
        fusion_stats['errors'] += 1
        aborted = True
        
        df_enriched.loc[idx, 'fusion_needs_retry'] = True
        if idx not in retry_indices:
            retry_indices.append(idx)
        
        # Save immediately
        df_enriched.to_parquet(progress_file, index=True)
        with open(retry_queue_file, 'w', encoding='utf-8') as f:
            json.dump(retry_indices, f, ensure_ascii=False, indent=2)
        break
    
    except Exception as e:
        print(f"\n‚ö†Ô∏è Fehler bei Record {idx}: {e}")
        fusion_stats['errors'] += 1

# Final save
if fusion_count % SAVE_INTERVAL != 0 or fusion_count == 0:
    df_enriched.to_parquet(progress_file, index=True)
    with open(retry_queue_file, 'w', encoding='utf-8') as f:
        json.dump(retry_indices, f, ensure_ascii=False, indent=2)
    print(f"\nüíæ Finaler Stand gespeichert")

if aborted:
    print("\n‚õîÔ∏è Lauf abgebrochen (Ollama-Timeout)")

print("\n‚úÖ Fusion abgeschlossen")

In [None]:
# üåç LANGUAGE FUSION (Dual-Source Strategie)
print("\nüåç === LANGUAGE FUSION ===\n")

def merge_language(row):
    """
    Merge MARC21 language and langdetect results.
    
    Priority:
    1. MARC21 language (from catalog metadata - most reliable)
    2. langdetect detected_language (from title analysis)
    3. None if neither available
    
    Returns:
        tuple: (language_final, language_source, language_confidence)
    """
    marc21_lang = row.get('language')
    detected_lang = row.get('detected_language')
    detected_conf = row.get('detected_language_confidence', 0.0)
    
    # MARC21 has priority
    if pd.notna(marc21_lang) and str(marc21_lang).strip() not in ['', 'unknown']:
        return str(marc21_lang).strip(), 'marc21', 1.0
    
    # Fallback to langdetect
    elif pd.notna(detected_lang) and str(detected_lang).strip() not in ['', 'unknown']:
        return str(detected_lang).strip(), 'langdetect', float(detected_conf) if pd.notna(detected_conf) else 0.0
    
    # No language information
    else:
        return None, None, 0.0

# Apply language fusion
if 'language' in df_enriched.columns or 'detected_language' in df_enriched.columns:
    print("üìä Applying dual-source language fusion...")
    
    # Create new columns for merged language
    df_enriched[['language_final', 'language_source', 'language_confidence']] = df_enriched.apply(
        merge_language, axis=1, result_type='expand'
    )
    
    # Statistics
    marc21_count = df_enriched[df_enriched['language_source'] == 'marc21'].shape[0]
    langdetect_count = df_enriched[df_enriched['language_source'] == 'langdetect'].shape[0]
    total_with_lang = df_enriched['language_final'].notna().sum()
    
    print(f"\nüìä Language Fusion Results:")
    print(f"   Total with language: {total_with_lang:,} ({total_with_lang/len(df_enriched)*100:.1f}%)")
    print(f"   From MARC21: {marc21_count:,} ({marc21_count/len(df_enriched)*100:.1f}%)")
    print(f"   From langdetect: {langdetect_count:,} ({langdetect_count/len(df_enriched)*100:.1f}%)")
    print(f"   No language: {len(df_enriched) - total_with_lang:,} ({(len(df_enriched) - total_with_lang)/len(df_enriched)*100:.1f}%)")
    
    # Language distribution
    print(f"\nüåç Top 10 Languages (final):")
    lang_dist = df_enriched['language_final'].value_counts().head(10)
    for lang, count in lang_dist.items():
        if pd.notna(lang):
            pct = count/total_with_lang*100 if total_with_lang > 0 else 0
            print(f"   {str(lang):10}: {count:6,} ({pct:5.1f}%)")
    
    print("\n‚úÖ Language fusion complete")
else:
    print("‚ö†Ô∏è No language columns found - skipping language fusion")

In [None]:
# üìä FUSION-STATISTIKEN
print("üìä === FUSION-ERGEBNISSE ===\n")

# Statistics AFTER fusion
print("üìä Vollst√§ndigkeit NACH Fusion:")
after_stats = {
    'title': df_enriched['title'].notna().sum(),
    'authors': (df_enriched['authors_str'].notna() & (df_enriched['authors_str'] != '')).sum(),
    'year': df_enriched['year'].notna().sum(),
    'publisher': df_enriched['publisher'].notna().sum()
}
for field, count in after_stats.items():
    improvement = count - before_stats[field]
    print(f"   {field}: {count:,} ({count/len(df_enriched)*100:.1f}%) [+{improvement:,}]")

# Fusion statistics
print(f"\nüìä Fusion-Statistiken:")
print(f"   Verarbeitet: {fusion_stats['total_processed']:,}")
print(f"   Einfache Merges: {fusion_stats['simple_merges']:,}")
print(f"   DNB gew√§hlt: {fusion_stats['dnb_preferred']:,}")
print(f"   Konflikte: {fusion_stats['conflicts_found']:,}")
print(f"   üö´ DNB verworfen: {fusion_stats['dnb_matches_rejected']:,}")
print(f"   KI-Entscheidungen: {fusion_stats['ai_decisions']:,}")
print(f"   Variante ID: {fusion_stats['variant_id']:,}")
print(f"   Variante Titel/Autor: {fusion_stats['variant_title_author']:,}")

# Source distribution
print(f"\nüìä Datenquellen:")
for field in ['title', 'authors', 'year', 'publisher']:
    source_col = f'fusion_{field}_source'
    if source_col in df_enriched.columns:
        sources = df_enriched[source_col].value_counts()
        print(f"\n   {field.upper()}:")
        for source, count in sources.items():
            if source:
                print(f"     {source}: {count:,}")

In [None]:
# üìù GAP FILLING: Fehlende Felder aus DNB-Daten erg√§nzen
print("\nüìù === GAP FILLING ===\n")

print("F√ºlle fehlende Metadaten aus DNB-Daten...\n")

# Statistics BEFORE gap filling
before_gap_filling = {
    'isbn': df_enriched['isbn'].notna().sum(),
    'issn': df_enriched['issn'].notna().sum() if 'issn' in df_enriched.columns else 0,
}

filled_count = {
    'isbn': 0,
    'issn': 0,
    'authors': 0,
    'year': 0,
    'publisher': 0
}

# 1. ISBN Gap Filling
# Priority: dnb_isbn_ta (from title/author search - finds new ISBNs) > dnb_isbn (from ISBN search - only duplicates)
if 'dnb_isbn_ta' in df_enriched.columns:
    # Records with no ISBN but DNB has one
    no_isbn = df_enriched['isbn'].isna()
    has_dnb_isbn_ta = df_enriched['dnb_isbn_ta'].notna()
    
    fill_isbn_mask = no_isbn & has_dnb_isbn_ta
    filled_count['isbn'] = fill_isbn_mask.sum()
    
    if filled_count['isbn'] > 0:
        df_enriched.loc[fill_isbn_mask, 'isbn'] = df_enriched.loc[fill_isbn_mask, 'dnb_isbn_ta']
        # Mark source
        if 'isbn_source' not in df_enriched.columns:
            df_enriched['isbn_source'] = None
        df_enriched.loc[fill_isbn_mask, 'isbn_source'] = 'dnb_title_author'
        
        print(f"   ISBN: {filled_count['isbn']:,} neu gef√ºllt aus dnb_isbn_ta")

# 2. ISSN Gap Filling
if 'issn' in df_enriched.columns and 'dnb_issn_ta' in df_enriched.columns:
    no_issn = df_enriched['issn'].isna()
    has_dnb_issn_ta = df_enriched['dnb_issn_ta'].notna()
    
    fill_issn_mask = no_issn & has_dnb_issn_ta
    filled_count['issn'] = fill_issn_mask.sum()
    
    if filled_count['issn'] > 0:
        df_enriched.loc[fill_issn_mask, 'issn'] = df_enriched.loc[fill_issn_mask, 'dnb_issn_ta']
        # Mark source
        if 'issn_source' not in df_enriched.columns:
            df_enriched['issn_source'] = None
        df_enriched.loc[fill_issn_mask, 'issn_source'] = 'dnb_title_author'
        
        print(f"   ISSN: {filled_count['issn']:,} neu gef√ºllt aus dnb_issn_ta")

# 3. Authors Gap Filling (from DNB where fusion didn't already fill)
# This fills authors that were NOT handled by fusion (e.g., records without fusion)
no_authors = (df_enriched['authors_str'].isna() | (df_enriched['authors_str'] == ''))
not_fused = df_enriched['fusion_authors_source'].isna()

# Try dnb_authors_ta first
if 'dnb_authors_ta' in df_enriched.columns:
    has_dnb_authors_ta = (df_enriched['dnb_authors_ta'].notna() & (df_enriched['dnb_authors_ta'] != ''))
    fill_authors_mask = no_authors & not_fused & has_dnb_authors_ta
    
    if fill_authors_mask.sum() > 0:
        df_enriched.loc[fill_authors_mask, 'authors_str'] = df_enriched.loc[fill_authors_mask, 'dnb_authors_ta']
        df_enriched.loc[fill_authors_mask, 'fusion_authors_source'] = 'dnb_title_author_gap_fill'
        filled_count['authors'] += fill_authors_mask.sum()

# Then try dnb_authors
if 'dnb_authors' in df_enriched.columns:
    no_authors = (df_enriched['authors_str'].isna() | (df_enriched['authors_str'] == ''))
    not_fused = df_enriched['fusion_authors_source'].isna()
    has_dnb_authors = (df_enriched['dnb_authors'].notna() & (df_enriched['dnb_authors'] != ''))
    fill_authors_mask = no_authors & not_fused & has_dnb_authors
    
    if fill_authors_mask.sum() > 0:
        df_enriched.loc[fill_authors_mask, 'authors_str'] = df_enriched.loc[fill_authors_mask, 'dnb_authors']
        df_enriched.loc[fill_authors_mask, 'fusion_authors_source'] = 'dnb_id_gap_fill'
        filled_count['authors'] += fill_authors_mask.sum()

if filled_count['authors'] > 0:
    print(f"   Authors: {filled_count['authors']:,} neu gef√ºllt aus DNB")

# 4. Year Gap Filling
no_year = df_enriched['year'].isna()
not_fused = df_enriched['fusion_year_source'].isna()

# Try dnb_year_ta first
if 'dnb_year_ta' in df_enriched.columns:
    has_dnb_year_ta = df_enriched['dnb_year_ta'].notna()
    fill_year_mask = no_year & not_fused & has_dnb_year_ta
    
    if fill_year_mask.sum() > 0:
        df_enriched.loc[fill_year_mask, 'year'] = df_enriched.loc[fill_year_mask, 'dnb_year_ta']
        df_enriched.loc[fill_year_mask, 'fusion_year_source'] = 'dnb_title_author_gap_fill'
        filled_count['year'] += fill_year_mask.sum()

# Then try dnb_year
if 'dnb_year' in df_enriched.columns:
    no_year = df_enriched['year'].isna()
    not_fused = df_enriched['fusion_year_source'].isna()
    has_dnb_year = df_enriched['dnb_year'].notna()
    fill_year_mask = no_year & not_fused & has_dnb_year
    
    if fill_year_mask.sum() > 0:
        df_enriched.loc[fill_year_mask, 'year'] = df_enriched.loc[fill_year_mask, 'dnb_year']
        df_enriched.loc[fill_year_mask, 'fusion_year_source'] = 'dnb_id_gap_fill'
        filled_count['year'] += fill_year_mask.sum()

if filled_count['year'] > 0:
    print(f"   Year: {filled_count['year']:,} neu gef√ºllt aus DNB")

# 5. Publisher Gap Filling
no_publisher = df_enriched['publisher'].isna()
not_fused = df_enriched['fusion_publisher_source'].isna()

# Try dnb_publisher_ta first
if 'dnb_publisher_ta' in df_enriched.columns:
    has_dnb_pub_ta = df_enriched['dnb_publisher_ta'].notna()
    fill_pub_mask = no_publisher & not_fused & has_dnb_pub_ta
    
    if fill_pub_mask.sum() > 0:
        df_enriched.loc[fill_pub_mask, 'publisher'] = df_enriched.loc[fill_pub_mask, 'dnb_publisher_ta']
        df_enriched.loc[fill_pub_mask, 'fusion_publisher_source'] = 'dnb_title_author_gap_fill'
        filled_count['publisher'] += fill_pub_mask.sum()

# Then try dnb_publisher
if 'dnb_publisher' in df_enriched.columns:
    no_publisher = df_enriched['publisher'].isna()
    not_fused = df_enriched['fusion_publisher_source'].isna()
    has_dnb_pub = df_enriched['dnb_publisher'].notna()
    fill_pub_mask = no_publisher & not_fused & has_dnb_pub
    
    if fill_pub_mask.sum() > 0:
        df_enriched.loc[fill_pub_mask, 'publisher'] = df_enriched.loc[fill_pub_mask, 'dnb_publisher']
        df_enriched.loc[fill_pub_mask, 'fusion_publisher_source'] = 'dnb_id_gap_fill'
        filled_count['publisher'] += fill_pub_mask.sum()

if filled_count['publisher'] > 0:
    print(f"   Publisher: {filled_count['publisher']:,} neu gef√ºllt aus DNB")

# Statistics AFTER gap filling
after_gap_filling = {
    'isbn': df_enriched['isbn'].notna().sum(),
    'issn': df_enriched['issn'].notna().sum() if 'issn' in df_enriched.columns else 0,
}

print("\nüìä Gap Filling Zusammenfassung:")
total_filled = sum(filled_count.values())
print(f"   Gesamt neu gef√ºllt: {total_filled:,} Felder")
print(f"   ISBN: {before_gap_filling['isbn']:,} ‚Üí {after_gap_filling['isbn']:,} (+{filled_count['isbn']:,})")
print(f"   ISSN: {before_gap_filling['issn']:,} ‚Üí {after_gap_filling['issn']:,} (+{filled_count['issn']:,})")

print("\n‚úÖ Gap Filling abgeschlossen")


In [None]:
# üíæ FINALE AUSGABE SPEICHERN
import numpy as np

output_path = processed_dir / '05_fused_data.parquet'
output_metadata_path = processed_dir / '05_metadata.json'

# Save fused data
df_enriched.to_parquet(output_path, index=True)
print(f"üíæ Fusionierte Daten gespeichert: {output_path}")
print(f"   Gr√∂√üe: {output_path.stat().st_size / 1024**2:.1f} MB")

# Helper function to convert numpy/pandas types to native Python types
def convert_to_native(obj):
    """Recursively convert numpy/pandas types to native Python types for JSON serialization."""
    if isinstance(obj, dict):
        return {k: convert_to_native(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_to_native(item) for item in obj]
    elif isinstance(obj, (np.integer, np.int64)):
        return int(obj)
    elif isinstance(obj, (np.floating, np.float64)):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    elif pd.isna(obj):
        return None
    else:
        return obj

# Save metadata with type conversion
metadata = {
    'notebook': '05_vdeh_data_fusion',
    'timestamp': pd.Timestamp.now().isoformat(),
    'input_file': str(input_path),
    'output_file': str(output_path),
    'total_records': int(len(df_enriched)),
    'fusion_statistics': convert_to_native(fusion_stats),
    'completeness_before': convert_to_native(before_stats),
    'completeness_after': convert_to_native(after_stats),
    'previous_metadata': prev_metadata
}

with open(output_metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2, ensure_ascii=False)

print(f"üìã Metadaten gespeichert: {output_metadata_path}")
print(f"\n‚úÖ Pipeline-Stufe 05 abgeschlossen!")