# VDEH-Sammlung Bestandsabgleich mit UB TU Freiberg

**Ziel:** √úberpr√ºfung, welche B√ºcher der VDEH-Sammlung im Bestand der UB TU Freiberg vorhanden sind

## üéØ Aufgabe
- **Vorher-Nachher-Vergleich**: Evaluation mit Original-VDEH vs. Fusionierte VDEH-Daten
- Lade UB MAB2-Katalogdaten
- Matching-Strategien: ISBN, Titel+Autor
- Identifiziere Matches und Nicht-Matches
- Vergleiche Matching-Erfolg vor und nach Fusion

## üìö Datenquellen
- **VDEH Original**: `data/vdeh/processed/03_language_detected_data.parquet` (vor DNB/LoC Anreicherung)
- **VDEH Fused**: `data/vdeh/processed/06_vdeh_dnb_loc_fused_data.parquet` (nach DNB/LoC Anreicherung)
- **UB MAB2**: `data/ub_tubaf/processed/01_loaded_data.parquet` (518.946 Datens√§tze)

## üîç Matching-Strategien
1. **ISBN-Match**: Exakte ISBN-√úbereinstimmung (h√∂chste Pr√§zision)
2. **Titel+Autor-Match**: Fuzzy-Matching auf normalisierte Strings

In [1]:
# üõ†Ô∏è SETUP
import sys
from pathlib import Path
import pandas as pd
import numpy as np
from datetime import datetime
import json
from rapidfuzz import fuzz
import re

# Add src to path
project_root = Path.cwd()
while not (project_root / 'config.yaml').exists() and project_root.parent != project_root:
    project_root = project_root.parent
sys.path.insert(0, str(project_root / 'src'))

from utils.notebook_utils import setup_notebook

project_root, config = setup_notebook()
print(f"‚úÖ Project root: {project_root}")
print(f"‚úÖ Project: {config.get('project.name')} v{config.get('project.version')}")

2026-01-02 15:40:44 - utils.notebook_utils - INFO - Searching for project root...
2026-01-02 15:40:44 - utils.notebook_utils - INFO - Project root found: /media/sz/Data/Bibo/analysis
2026-01-02 15:40:44 - utils.notebook_utils - INFO - Loading configuration...
2026-01-02 15:40:44 - config_loader - INFO - Configuration loaded from /media/sz/Data/Bibo/analysis/config.yaml
2026-01-02 15:40:44 - utils.notebook_utils - INFO - Configuration loaded successfully: Dual-Source Bibliothek Bestandsvergleich


‚úÖ Project root: /media/sz/Data/Bibo/analysis
‚úÖ Project: Dual-Source Bibliothek Bestandsvergleich v2.2.0


In [2]:
# üìÇ LOAD DATA
print("üìÇ Lade Daten...\n")

vdeh_processed_dir = config.project_root / config.get('paths.data.vdeh.processed')

# Load ORIGINAL VDEH data (BEFORE fusion)
vdeh_original_path = vdeh_processed_dir / '03_language_detected_data.parquet'
if not vdeh_original_path.exists():
    raise FileNotFoundError(f"VDEH Original-Daten nicht gefunden: {vdeh_original_path}")

df_vdeh_original = pd.read_parquet(vdeh_original_path)
print(f"üìö VDEH (ORIGINAL - vor Fusion): {len(df_vdeh_original):,} records")
print(f"   Spalten: {list(df_vdeh_original.columns)[:10]}...")

# Load FUSED VDEH data (AFTER DNB + LoC enrichment)
vdeh_fused_path = vdeh_processed_dir / '06_vdeh_dnb_loc_fused_data.parquet'
if not vdeh_fused_path.exists():
    raise FileNotFoundError(f"VDEH fusionierte Daten nicht gefunden: {vdeh_fused_path}")

df_vdeh_fused = pd.read_parquet(vdeh_fused_path)
print(f"üìö VDEH (FUSIONIERT - nach DNB+LoC): {len(df_vdeh_fused):,} records")
print(f"   Spalten: {list(df_vdeh_fused.columns)[:10]}...")

# Load UB catalog data
ub_processed_dir = config.project_root / 'data/ub_tubaf/processed'
ub_path = ub_processed_dir / '01_loaded_data.parquet'

if not ub_path.exists():
    raise FileNotFoundError(f"UB Katalogdaten nicht gefunden: {ub_path}")

df_ub = pd.read_parquet(ub_path)
print(f"üìö UB TU Freiberg:                   {len(df_ub):,} records")
print(f"   Spalten: {list(df_ub.columns)}")

print(f"\n‚úÖ Daten geladen")

üìÇ Lade Daten...

üìö VDEH (ORIGINAL - vor Fusion): 58,305 records
   Spalten: ['id', 'title', 'authors', 'authors_affiliation', 'year', 'publisher', 'isbn', 'issn', 'pages', 'language']...
üìö VDEH (FUSIONIERT - nach DNB+LoC): 58,305 records
   Spalten: ['title', 'authors', 'year', 'publisher', 'pages', 'isbn', 'issn', 'title_source', 'authors_source', 'year_source']...
üìö UB TU Freiberg:                   518,946 records
   Spalten: ['id', 'source', 'title', 'authors', 'authors_str', 'year', 'isbn', 'place', 'physical_desc']

‚úÖ Daten geladen


In [3]:
# üìä DATA OVERVIEW - COMPARISON
print("üìä Daten√ºbersicht - VORHER/NACHHER Vergleich:\n")
print("="*80)

print("\nüîµ VDEH ORIGINAL (vor Fusion):")
print(f"   Total:      {len(df_vdeh_original):,}")
print(f"   Mit Titel:  {df_vdeh_original['title'].notna().sum():,} ({df_vdeh_original['title'].notna().sum()/len(df_vdeh_original)*100:.1f}%)")
print(f"   Mit Autor:  {df_vdeh_original['authors'].notna().sum():,} ({df_vdeh_original['authors'].notna().sum()/len(df_vdeh_original)*100:.1f}%)")
print(f"   Mit Jahr:   {df_vdeh_original['year'].notna().sum():,} ({df_vdeh_original['year'].notna().sum()/len(df_vdeh_original)*100:.1f}%)")

# Check for ISBN in original VDEH
vdeh_orig_isbn_cols = [col for col in df_vdeh_original.columns if 'isbn' in col.lower()]
print(f"   ISBN-Spalten: {vdeh_orig_isbn_cols}")
if vdeh_orig_isbn_cols:
    isbn_count_orig = sum(df_vdeh_original[col].notna().sum() for col in vdeh_orig_isbn_cols if col in df_vdeh_original.columns)
    print(f"   Mit ISBN:   {isbn_count_orig:,}")

print("\nüü¢ VDEH FUSIONIERT (nach DNB+LoC):")
print(f"   Total:      {len(df_vdeh_fused):,}")
print(f"   Mit Titel:  {df_vdeh_fused['title'].notna().sum():,} ({df_vdeh_fused['title'].notna().sum()/len(df_vdeh_fused)*100:.1f}%)")
print(f"   Mit Autor:  {df_vdeh_fused['authors'].notna().sum():,} ({df_vdeh_fused['authors'].notna().sum()/len(df_vdeh_fused)*100:.1f}%)")
print(f"   Mit Jahr:   {df_vdeh_fused['year'].notna().sum():,} ({df_vdeh_fused['year'].notna().sum()/len(df_vdeh_fused)*100:.1f}%)")

# Check for ISBN in fused VDEH - k√∂nnte aus verschiedenen Quellen stammen
vdeh_fused_isbn_cols = [col for col in df_vdeh_fused.columns if 'isbn' in col.lower()]
print(f"   ISBN-Spalten: {vdeh_fused_isbn_cols}")

print("\nüìñ UB TU Freiberg:")
print(f"   Total:      {len(df_ub):,}")
print(f"   Mit Titel:  {df_ub['title'].notna().sum():,} ({df_ub['title'].notna().sum()/len(df_ub)*100:.1f}%)")
print(f"   Mit Autor:  {df_ub['authors_str'].notna().sum():,} ({df_ub['authors_str'].notna().sum()/len(df_ub)*100:.1f}%)")
print(f"   Mit Jahr:   {df_ub['year'].notna().sum():,} ({df_ub['year'].notna().sum()/len(df_ub)*100:.1f}%)")
print(f"   Mit ISBN:   {df_ub['isbn'].notna().sum():,} ({df_ub['isbn'].notna().sum()/len(df_ub)*100:.1f}%)")
print("\n" + "="*80)

üìä Daten√ºbersicht - VORHER/NACHHER Vergleich:


üîµ VDEH ORIGINAL (vor Fusion):
   Total:      58,305
   Mit Titel:  58,242 (99.9%)
   Mit Autor:  58,305 (100.0%)
   Mit Jahr:   33,313 (57.1%)
   ISBN-Spalten: ['isbn', 'isbn_valid', 'isbn_status']
   Mit ISBN:   31,521

üü¢ VDEH FUSIONIERT (nach DNB+LoC):
   Total:      58,305
   Mit Titel:  58,249 (99.9%)
   Mit Autor:  58,305 (100.0%)
   Mit Jahr:   34,694 (59.5%)
   ISBN-Spalten: ['isbn', 'isbn_source']

üìñ UB TU Freiberg:
   Total:      518,946
   Mit Titel:  497,797 (95.9%)
   Mit Autor:  451,331 (87.0%)
   Mit Jahr:   496,993 (95.8%)
   Mit ISBN:   289,292 (55.7%)



In [4]:
# üîß HELPER FUNCTIONS
print("üîß Definiere Hilfsfunktionen...\n")

def normalize_isbn(isbn):
    """Normalisiert ISBN: entfernt Bindestriche, Leerzeichen, konvertiert zu String."""
    if pd.isna(isbn) or isbn is None:
        return None
    isbn_str = str(isbn).strip()
    # Entferne alle Nicht-Ziffern au√üer X (f√ºr ISBN-10)
    isbn_clean = re.sub(r'[^0-9X]', '', isbn_str.upper())
    if len(isbn_clean) == 0:
        return None
    return isbn_clean

def isbn_10_to_13(isbn10):
    """Konvertiert ISBN-10 zu ISBN-13"""
    if len(isbn10) != 10:
        return None
    base = '978' + isbn10[:9]
    check_sum = 0
    for i, digit in enumerate(base):
        if i % 2 == 0:
            check_sum += int(digit)
        else:
            check_sum += int(digit) * 3
    check_digit = (10 - (check_sum % 10)) % 10
    return base + str(check_digit)

def isbn_13_to_10(isbn13):
    """Konvertiert ISBN-13 zu ISBN-10 (nur wenn 978-Pr√§fix)"""
    if len(isbn13) != 13 or not isbn13.startswith('978'):
        return None
    base = isbn13[3:12]
    check_sum = 0
    for i in range(9):
        check_sum += int(base[i]) * (10 - i)
    check_digit = (11 - (check_sum % 11)) % 11
    return base + ('X' if check_digit == 10 else str(check_digit))

def normalize_text(text):
    """Normalisiert Text f√ºr Fuzzy-Matching: Kleinbuchstaben, entfernt Sonderzeichen."""
    # Handle None
    if text is None:
        return ""
    # Handle NaN (check type first to avoid ValueError)
    if isinstance(text, float) and pd.isna(text):
        return ""
    # Handle lists/arrays (z.B. authors als Liste)
    if isinstance(text, (list, np.ndarray)):
        if len(text) == 0:
            return ""
        # Join list elements
        text = '; '.join(str(item) for item in text if item is not None and (not isinstance(item, float) or not pd.isna(item)))
        if not text:
            return ""
    text_str = str(text).lower()
    # Entferne Sonderzeichen, behalte nur Buchstaben, Zahlen, Leerzeichen
    text_clean = re.sub(r'[^a-z0-9\s]', ' ', text_str)
    # Mehrfache Leerzeichen durch einzelne ersetzen
    text_clean = re.sub(r'\s+', ' ', text_clean).strip()
    return text_clean

def get_isbn_from_vdeh(row, isbn_cols):
    """Extrahiert erste verf√ºgbare ISBN aus VDEH-Datenzeile."""
    for col in isbn_cols:
        if col in row.index and pd.notna(row[col]):
            normalized = normalize_isbn(row[col])
            if normalized:
                return normalized
    return None

def perform_matching(df_vdeh, vdeh_name, df_ub, ub_isbn_index, ub_for_fuzzy):
    """F√ºhrt ISBN und Fuzzy Matching durch und gibt Ergebnisse zur√ºck."""
    
    print(f"\n{'='*80}")
    print(f"üîç MATCHING: {vdeh_name}")
    print(f"{'='*80}\n")
    
    # Normalisiere VDEH Felder
    vdeh_isbn_cols = [col for col in df_vdeh.columns if 'isbn' in col.lower()]
    df_vdeh_work = df_vdeh.copy()
    df_vdeh_work['isbn_normalized'] = df_vdeh_work.apply(lambda row: get_isbn_from_vdeh(row, vdeh_isbn_cols), axis=1)
    df_vdeh_work['title_normalized'] = df_vdeh_work['title'].apply(normalize_text)
    df_vdeh_work['authors_normalized'] = df_vdeh_work['authors'].apply(normalize_text)
    
    print(f"Normalisierung:")
    print(f"   ISBNs gefunden: {df_vdeh_work['isbn_normalized'].notna().sum():,} ({df_vdeh_work['isbn_normalized'].notna().sum()/len(df_vdeh_work)*100:.1f}%)")
    print(f"   Normalisierte Titel: {df_vdeh_work['title_normalized'].str.len().gt(0).sum():,}")
    print(f"   Normalisierte Autoren: {df_vdeh_work['authors_normalized'].str.len().gt(0).sum():,}")
    
    # === STRATEGIE 1: ISBN-MATCHING ===
    print(f"\nüîç STRATEGIE 1: ISBN-MATCHING (mit ISBN-10/13 Konvertierung)")
    vdeh_with_isbn = df_vdeh_work[df_vdeh_work['isbn_normalized'].notna()].copy()
    print(f"   VDEH Datens√§tze mit ISBN: {len(vdeh_with_isbn):,}")
    
    isbn_matches = []
    for idx, row in vdeh_with_isbn.iterrows():
        isbn = row['isbn_normalized']
        if isbn in ub_isbn_index.index:
            ub_matches = ub_isbn_index.loc[isbn]
            if isinstance(ub_matches, pd.DataFrame):
                ub_match = ub_matches.iloc[0]
                match_count = len(ub_matches)
            else:
                ub_match = ub_matches
                match_count = 1
            
            isbn_matches.append({
                'vdeh_index': idx,
                'vdeh_title': row['title'],
                'vdeh_authors': row['authors'],
                'vdeh_year': row['year'],
                'isbn': isbn,
                'ub_id': ub_match['id'],
                'ub_title': ub_match['title'],
                'ub_authors': ub_match['authors_str'],
                'ub_year': ub_match['year'],
                'match_method': 'ISBN',
                'ub_match_count': match_count
            })
    
    df_isbn_matches = pd.DataFrame(isbn_matches)
    print(f"   ‚úÖ ISBN-Matches: {len(df_isbn_matches):,}")
    if len(vdeh_with_isbn) > 0:
        print(f"   Match-Rate (mit ISBN): {len(df_isbn_matches)/len(vdeh_with_isbn)*100:.1f}%")
    print(f"   Match-Rate (gesamt): {len(df_isbn_matches)/len(df_vdeh_work)*100:.1f}%")
    
    # === STRATEGIE 2: FUZZY MATCHING ===
    print(f"\nüîç STRATEGIE 2: TITEL+AUTOR FUZZY MATCHING")
    matched_indices = set(df_isbn_matches['vdeh_index'].values) if len(df_isbn_matches) > 0 else set()
    vdeh_no_isbn_match = df_vdeh_work[~df_vdeh_work.index.isin(matched_indices)].copy()
    
    vdeh_for_fuzzy = vdeh_no_isbn_match[
        (vdeh_no_isbn_match['title_normalized'].str.len() > 0) &
        (vdeh_no_isbn_match['authors_normalized'].str.len() > 0)
    ].copy()
    
    print(f"   VDEH ohne ISBN-Match: {len(vdeh_no_isbn_match):,}")
    print(f"   Mit Titel+Autor: {len(vdeh_for_fuzzy):,}")
    
    # Fuzzy Matching-Parameter
    TITLE_THRESHOLD = 85
    AUTHOR_THRESHOLD = 80
    MAX_CANDIDATES_PER_VDEH = 100
    
    fuzzy_matches = []
    start_time = datetime.now()
    
    for i, (vdeh_idx, vdeh_row) in enumerate(vdeh_for_fuzzy.iterrows()):
        if (i + 1) % 1000 == 0:
            elapsed = (datetime.now() - start_time).total_seconds()
            rate = (i + 1) / elapsed if elapsed > 0 else 0
            remaining = len(vdeh_for_fuzzy) - (i + 1)
            eta = remaining / rate if rate > 0 else 0
            print(f"\r   [{i+1:,}/{len(vdeh_for_fuzzy):,}] {(i+1)/len(vdeh_for_fuzzy)*100:.1f}% | "
                  f"{rate:.1f} rec/s | ETA: {eta/60:.1f} min | Matches: {len(fuzzy_matches):,}", end='', flush=True)
        
        vdeh_title = vdeh_row['title_normalized']
        vdeh_author = vdeh_row['authors_normalized']
        
        title_similarities = ub_for_fuzzy['title_normalized'].apply(
            lambda x: fuzz.ratio(vdeh_title, x)
        )
        
        title_candidates = title_similarities[title_similarities >= TITLE_THRESHOLD]
        
        if len(title_candidates) == 0:
            continue
        
        if len(title_candidates) > MAX_CANDIDATES_PER_VDEH:
            title_candidates = title_candidates.nlargest(MAX_CANDIDATES_PER_VDEH)
        
        for ub_idx in title_candidates.index:
            ub_row = ub_for_fuzzy.loc[ub_idx]
            ub_author = ub_row['authors_normalized']
            
            author_sim = fuzz.ratio(vdeh_author, ub_author)
            
            if author_sim >= AUTHOR_THRESHOLD:
                fuzzy_matches.append({
                    'vdeh_index': vdeh_idx,
                    'vdeh_title': vdeh_row['title'],
                    'vdeh_authors': vdeh_row['authors'],
                    'vdeh_year': vdeh_row['year'],
                    'ub_id': ub_row['id'],
                    'ub_title': ub_row['title'],
                    'ub_authors': ub_row['authors_str'],
                    'ub_year': ub_row['year'],
                    'match_method': 'Title+Author Fuzzy',
                    'title_similarity': title_similarities.loc[ub_idx],
                    'author_similarity': author_sim,
                    'combined_similarity': (title_similarities.loc[ub_idx] + author_sim) / 2
                })
                break
    
    df_fuzzy_matches = pd.DataFrame(fuzzy_matches)
    print(f"\n   ‚úÖ Fuzzy-Matches: {len(df_fuzzy_matches):,}")
    if len(vdeh_for_fuzzy) > 0:
        print(f"   Match-Rate: {len(df_fuzzy_matches)/len(vdeh_for_fuzzy)*100:.1f}%")
    print(f"   Dauer: {(datetime.now() - start_time).total_seconds()/60:.1f} Minuten")
    
    # === COMBINE RESULTS ===
    all_matches = []
    if len(df_isbn_matches) > 0:
        all_matches.append(df_isbn_matches)
    if len(df_fuzzy_matches) > 0:
        all_matches.append(df_fuzzy_matches)
    
    if len(all_matches) > 0:
        df_all_matches = pd.concat(all_matches, ignore_index=True)
        df_all_matches = df_all_matches.drop_duplicates(subset=['vdeh_index'], keep='first')
    else:
        df_all_matches = pd.DataFrame()
    
    # Nicht gefundene B√ºcher
    if len(df_all_matches) > 0:
        matched_indices = set(df_all_matches['vdeh_index'].values)
    else:
        matched_indices = set()
    df_not_found = df_vdeh_work[~df_vdeh_work.index.isin(matched_indices)].copy()
    
    # Statistiken
    stats = {
        'dataset_name': vdeh_name,
        'total_vdeh_books': len(df_vdeh_work),
        'total_ub_books': len(df_ub),
        'vdeh_with_isbn': int(df_vdeh_work['isbn_normalized'].notna().sum()),
        'isbn_matches': len(df_isbn_matches),
        'fuzzy_matches': len(df_fuzzy_matches),
        'total_matches': len(df_all_matches),
        'not_found': len(df_not_found),
        'match_rate': len(df_all_matches) / len(df_vdeh_work) * 100 if len(df_all_matches) > 0 else 0,
        'isbn_match_rate': len(df_isbn_matches) / df_vdeh_work['isbn_normalized'].notna().sum() * 100 if df_vdeh_work['isbn_normalized'].notna().sum() > 0 else 0,
        'timestamp': datetime.now().isoformat()
    }
    
    print(f"\nüìä ZUSAMMENFASSUNG:")
    print(f"   Total Matches: {stats['total_matches']:,} ({stats['match_rate']:.1f}%)")
    print(f"   - ISBN-Matches: {stats['isbn_matches']:,}")
    print(f"   - Fuzzy-Matches: {stats['fuzzy_matches']:,}")
    print(f"   Nicht gefunden: {stats['not_found']:,} ({100-stats['match_rate']:.1f}%)")
    
    return df_all_matches, df_not_found, stats

print("‚úÖ Hilfsfunktionen definiert")

üîß Definiere Hilfsfunktionen...

‚úÖ Hilfsfunktionen definiert


In [5]:
# üîÑ PREPARE UB DATA (einmalig f√ºr beide Matchings)
print("üîÑ Bereite UB-Daten f√ºr Matching vor...\n")

# UB: Normalisierte Felder hinzuf√ºgen
df_ub['isbn_normalized'] = df_ub['isbn'].apply(normalize_isbn)
df_ub['title_normalized'] = df_ub['title'].apply(normalize_text)
df_ub['authors_normalized'] = df_ub['authors_str'].apply(normalize_text)

print("UB Normalisierung:")
print(f"   ISBNs normalisiert: {df_ub['isbn_normalized'].notna().sum():,} ({df_ub['isbn_normalized'].notna().sum()/len(df_ub)*100:.1f}%)")
print(f"   Normalisierte Titel: {df_ub['title_normalized'].str.len().gt(0).sum():,}")
print(f"   Normalisierte Autoren: {df_ub['authors_normalized'].str.len().gt(0).sum():,}")

# Erstelle erweiterten ISBN-Index f√ºr UB (mit ISBN-10/13 Konvertierung)
print("\nüîÑ Erstelle erweiterten ISBN-Index mit ISBN-10/13 Konvertierung...")
ub_isbn_map = {}

for idx, row in df_ub[df_ub['isbn_normalized'].notna()].iterrows():
    isbn = row['isbn_normalized']
    
    # Original ISBN
    ub_isbn_map[isbn] = row
    
    # Konvertierung ISBN-10 <-> ISBN-13
    if len(isbn) == 10:
        isbn13 = isbn_10_to_13(isbn)
        if isbn13:
            ub_isbn_map[isbn13] = row
    elif len(isbn) == 13 and isbn.startswith('978'):
        isbn10 = isbn_13_to_10(isbn)
        if isbn10:
            ub_isbn_map[isbn10] = row

# Erstelle DataFrame aus Map und setze Index
ub_isbn_data = list(ub_isbn_map.values())
ub_isbn_keys = list(ub_isbn_map.keys())

if ub_isbn_data:
    ub_isbn_index = pd.DataFrame(ub_isbn_data)
    ub_isbn_index.index = pd.Index(ub_isbn_keys)
    
    print(f"‚úÖ UB ISBN-Index erstellt: {len(set(ub_isbn_keys)):,} eindeutige ISBNs")
    print(f"   (davon {len(set(ub_isbn_keys)) - df_ub['isbn_normalized'].notna().sum():,} durch ISBN-10/13 Konvertierung)")
else:
    ub_isbn_index = pd.DataFrame()
    print(f"‚ö†Ô∏è  Keine ISBNs gefunden")

# UB-Kandidaten f√ºr Fuzzy Matching
ub_for_fuzzy = df_ub[
    (df_ub['title_normalized'].str.len() > 0) &
    (df_ub['authors_normalized'].str.len() > 0)
].copy()
print(f"‚úÖ UB Fuzzy-Kandidaten: {len(ub_for_fuzzy):,} Eintr√§ge")

üîÑ Bereite UB-Daten f√ºr Matching vor...

UB Normalisierung:
   ISBNs normalisiert: 289,292 (55.7%)
   Normalisierte Titel: 497,797
   Normalisierte Autoren: 451,330

üîÑ Erstelle erweiterten ISBN-Index mit ISBN-10/13 Konvertierung...
‚úÖ UB ISBN-Index erstellt: 519,124 eindeutige ISBNs
   (davon 229,832 durch ISBN-10/13 Konvertierung)
‚úÖ UB Fuzzy-Kandidaten: 441,101 Eintr√§ge


In [6]:
# üîµ MATCHING 1: ORIGINAL VDEH (vor Fusion)
df_matches_original, df_not_found_original, stats_original = perform_matching(
    df_vdeh_original, 
    "VDEH ORIGINAL (vor DNB/LoC)",
    df_ub,
    ub_isbn_index,
    ub_for_fuzzy
)


üîç MATCHING: VDEH ORIGINAL (vor DNB/LoC)

Normalisierung:
   ISBNs gefunden: 10,507 (18.0%)
   Normalisierte Titel: 58,238
   Normalisierte Autoren: 17,536

üîç STRATEGIE 1: ISBN-MATCHING (mit ISBN-10/13 Konvertierung)
   VDEH Datens√§tze mit ISBN: 10,507
   ‚úÖ ISBN-Matches: 3,545
   Match-Rate (mit ISBN): 33.7%
   Match-Rate (gesamt): 6.1%

üîç STRATEGIE 2: TITEL+AUTOR FUZZY MATCHING
   VDEH ohne ISBN-Match: 54,760
   Mit Titel+Autor: 14,581
   [14,000/14,581] 96.0% | 3.9 rec/s | ETA: 2.5 min | Matches: 8556
   ‚úÖ Fuzzy-Matches: 883
   Match-Rate: 6.1%
   Dauer: 62.0 Minuten

üìä ZUSAMMENFASSUNG:
   Total Matches: 4,428 (7.6%)
   - ISBN-Matches: 3,545
   - Fuzzy-Matches: 883
   Nicht gefunden: 53,877 (92.4%)


In [7]:
# üü¢ MATCHING 2: FUSED VDEH (nach DNB+LoC Fusion)
df_matches_fused, df_not_found_fused, stats_fused = perform_matching(
    df_vdeh_fused,
    "VDEH FUSIONIERT (nach DNB/LoC)",
    df_ub,
    ub_isbn_index,
    ub_for_fuzzy
)


üîç MATCHING: VDEH FUSIONIERT (nach DNB/LoC)

Normalisierung:
   ISBNs gefunden: 14,845 (25.5%)
   Normalisierte Titel: 58,246
   Normalisierte Autoren: 22,464

üîç STRATEGIE 1: ISBN-MATCHING (mit ISBN-10/13 Konvertierung)
   VDEH Datens√§tze mit ISBN: 14,845
   ‚úÖ ISBN-Matches: 5,547
   Match-Rate (mit ISBN): 37.4%
   Match-Rate (gesamt): 9.5%

üîç STRATEGIE 2: TITEL+AUTOR FUZZY MATCHING
   VDEH ohne ISBN-Match: 52,758
   Mit Titel+Autor: 17,709
   [17,000/17,709] 96.0% | 4.2 rec/s | ETA: 2.8 min | Matches: 1,1771
   ‚úÖ Fuzzy-Matches: 1,226
   Match-Rate: 6.9%
   Dauer: 70.3 Minuten

üìä ZUSAMMENFASSUNG:
   Total Matches: 6,773 (11.6%)
   - ISBN-Matches: 5,547
   - Fuzzy-Matches: 1,226
   Nicht gefunden: 51,532 (88.4%)


In [8]:
# üìä VORHER-NACHHER VERGLEICH
print("\n" + "="*80)
print("üìä VORHER-NACHHER VERGLEICH")
print("="*80)

comparison = pd.DataFrame([
    {
        'Dataset': 'üîµ ORIGINAL (vor Fusion)',
        'Total B√ºcher': stats_original['total_vdeh_books'],
        'Mit ISBN': stats_original['vdeh_with_isbn'],
        'ISBN %': f"{stats_original['vdeh_with_isbn']/stats_original['total_vdeh_books']*100:.1f}%",
        'ISBN-Matches': stats_original['isbn_matches'],
        'Fuzzy-Matches': stats_original['fuzzy_matches'],
        'Total Matches': stats_original['total_matches'],
        'Match-Rate': f"{stats_original['match_rate']:.1f}%",
        'Nicht gefunden': stats_original['not_found']
    },
    {
        'Dataset': 'üü¢ FUSIONIERT (nach DNB+LoC)',
        'Total B√ºcher': stats_fused['total_vdeh_books'],
        'Mit ISBN': stats_fused['vdeh_with_isbn'],
        'ISBN %': f"{stats_fused['vdeh_with_isbn']/stats_fused['total_vdeh_books']*100:.1f}%",
        'ISBN-Matches': stats_fused['isbn_matches'],
        'Fuzzy-Matches': stats_fused['fuzzy_matches'],
        'Total Matches': stats_fused['total_matches'],
        'Match-Rate': f"{stats_fused['match_rate']:.1f}%",
        'Nicht gefunden': stats_fused['not_found']
    }
])

display(comparison)

# Berechne Verbesserungen
print("\nüìà VERBESSERUNGEN DURCH FUSION:")
print("="*80)

isbn_improvement = stats_fused['vdeh_with_isbn'] - stats_original['vdeh_with_isbn']
isbn_improvement_pct = (stats_fused['vdeh_with_isbn'] / stats_original['vdeh_with_isbn'] - 1) * 100 if stats_original['vdeh_with_isbn'] > 0 else 0
print(f"ISBNs:          +{isbn_improvement:,} ({isbn_improvement_pct:+.1f}%)")

isbn_match_improvement = stats_fused['isbn_matches'] - stats_original['isbn_matches']
isbn_match_improvement_pct = (stats_fused['isbn_matches'] / stats_original['isbn_matches'] - 1) * 100 if stats_original['isbn_matches'] > 0 else 0
print(f"ISBN-Matches:   +{isbn_match_improvement:,} ({isbn_match_improvement_pct:+.1f}%)")

total_match_improvement = stats_fused['total_matches'] - stats_original['total_matches']
total_match_improvement_pct = (stats_fused['total_matches'] / stats_original['total_matches'] - 1) * 100 if stats_original['total_matches'] > 0 else 0
print(f"Total Matches:  +{total_match_improvement:,} ({total_match_improvement_pct:+.1f}%)")

match_rate_improvement = stats_fused['match_rate'] - stats_original['match_rate']
print(f"Match-Rate:     {stats_original['match_rate']:.1f}% ‚Üí {stats_fused['match_rate']:.1f}% ({match_rate_improvement:+.1f} Prozentpunkte)")

print("\n‚úÖ Vergleich abgeschlossen")


üìä VORHER-NACHHER VERGLEICH


Unnamed: 0,Dataset,Total B√ºcher,Mit ISBN,ISBN %,ISBN-Matches,Fuzzy-Matches,Total Matches,Match-Rate,Nicht gefunden
0,üîµ ORIGINAL (vor Fusion),58305,10507,18.0%,3545,883,4428,7.6%,53877
1,üü¢ FUSIONIERT (nach DNB+LoC),58305,14845,25.5%,5547,1226,6773,11.6%,51532



üìà VERBESSERUNGEN DURCH FUSION:
ISBNs:          +4,338 (+41.3%)
ISBN-Matches:   +2,002 (+56.5%)
Total Matches:  +2,345 (+53.0%)
Match-Rate:     7.6% ‚Üí 11.6% (+4.0 Prozentpunkte)

‚úÖ Vergleich abgeschlossen


In [9]:
# üîç ANALYSE DER ZUS√ÑTZLICHEN MATCHES
print("\n" + "="*80)
print("üîç ANALYSE: Welche zus√§tzlichen B√ºcher wurden durch Fusion gefunden?")
print("="*80 + "\n")

# Identifiziere neue Matches
if len(df_matches_original) > 0 and len(df_matches_fused) > 0:
    original_matched_indices = set(df_matches_original['vdeh_index'].values)
    fused_matched_indices = set(df_matches_fused['vdeh_index'].values)
    
    # Neue Matches = in fused gefunden, aber nicht in original
    new_match_indices = fused_matched_indices - original_matched_indices
    
    print(f"Neue Matches durch Fusion: {len(new_match_indices):,}")
    
    if len(new_match_indices) > 0:
        df_new_matches = df_matches_fused[df_matches_fused['vdeh_index'].isin(new_match_indices)].copy()
        
        # Analysiere Match-Methoden der neuen Matches
        print("\nMatch-Methoden der neuen Matches:")
        method_dist = df_new_matches['match_method'].value_counts()
        for method, count in method_dist.items():
            print(f"   {method:25s}: {count:,} ({count/len(df_new_matches)*100:.1f}%)")
        
        # Zeige Beispiele
        print("\nüìã Beispiele neuer Matches durch Fusion:")
        print("="*80)
        for i, (idx, row) in enumerate(df_new_matches.head(5).iterrows()):
            print(f"\nBeispiel {i+1}:")
            print(f"  VDEH: {row['vdeh_title'][:70]}")
            print(f"        {row['vdeh_authors'][:70] if pd.notna(row['vdeh_authors']) else 'N/A'}")
            print(f"  UB:   {row['ub_title'][:70]}")
            print(f"  Methode: {row['match_method']}")
            if 'isbn' in row and pd.notna(row['isbn']):
                print(f"  ISBN: {row['isbn']}")
            if 'title_similarity' in row:
                print(f"  Similarity: Titel={row['title_similarity']:.0f}%, Autor={row['author_similarity']:.0f}%")
else:
    print("‚ö†Ô∏è  Keine neuen Matches zu analysieren")


üîç ANALYSE: Welche zus√§tzlichen B√ºcher wurden durch Fusion gefunden?

Neue Matches durch Fusion: 2,862

Match-Methoden der neuen Matches:
   ISBN                     : 2,405 (84.0%)
   Title+Author Fuzzy       : 457 (16.0%)

üìã Beispiele neuer Matches durch Fusion:

Beispiel 1:
  VDEH: Powder metallurgy science /
        German, Randall M.,
  UB:   Powder metallurgy science : Randall M. German
  Methode: ISBN
  ISBN: 1878954423
  Similarity: Titel=nan%, Autor=nan%

Beispiel 2:
  VDEH: HaÃàrtereitechnisches Fachwissen
        Mainka, Joachim
  UB:   H√§rtereitechnisches Fachwissen : Joachim Mainka
  Methode: ISBN
  ISBN: 9783342004011
  Similarity: Titel=nan%, Autor=nan%

Beispiel 3:
  VDEH: Grundlagen metallischer Werkstoffe, Korrosion und Korrosionsschutz
        
  UB:   Grundlagen metallischer Werkstoffe, Korrosion und Korrosionsschutz : v
  Methode: ISBN
  ISBN: 9783342002741
  Similarity: Titel=nan%, Autor=nan%

Beispiel 4:
  VDEH: Stahlfibel
        Eube, Joachim
  UB:   S

In [10]:
# üíæ SAVE RESULTS
print("\n" + "="*80)
print("üíæ SPEICHERE ERGEBNISSE")
print("="*80 + "\n")

comparison_dir = config.project_root / 'notebooks/03_comparison/results'
comparison_dir.mkdir(parents=True, exist_ok=True)

# 1. Original Matches
if len(df_matches_original) > 0:
    matches_orig_path = comparison_dir / 'vdeh_ub_matches_original.parquet'
    df_matches_original.to_parquet(matches_orig_path, index=False)
    print(f"‚úÖ Original Matches: {matches_orig_path}")
    print(f"   Records: {len(df_matches_original):,}")
    
    matches_orig_csv = comparison_dir / 'vdeh_ub_matches_original.csv'
    df_matches_original.to_csv(matches_orig_csv, index=False, encoding='utf-8-sig')
    print(f"   CSV: {matches_orig_csv}")

# 2. Fused Matches
if len(df_matches_fused) > 0:
    matches_fused_path = comparison_dir / 'vdeh_ub_matches_fused.parquet'
    df_matches_fused.to_parquet(matches_fused_path, index=False)
    print(f"\n‚úÖ Fused Matches: {matches_fused_path}")
    print(f"   Records: {len(df_matches_fused):,}")
    
    matches_fused_csv = comparison_dir / 'vdeh_ub_matches_fused.csv'
    df_matches_fused.to_csv(matches_fused_csv, index=False, encoding='utf-8-sig')
    print(f"   CSV: {matches_fused_csv}")

# 3. Not found - Original
if len(df_not_found_original) > 0:
    not_found_orig_path = comparison_dir / 'vdeh_not_in_ub_original.parquet'
    df_not_found_original.to_parquet(not_found_orig_path, index=True)
    print(f"\n‚úÖ Nicht gefunden (Original): {not_found_orig_path}")
    print(f"   Records: {len(df_not_found_original):,}")

# 4. Not found - Fused
if len(df_not_found_fused) > 0:
    not_found_fused_path = comparison_dir / 'vdeh_not_in_ub_fused.parquet'
    df_not_found_fused.to_parquet(not_found_fused_path, index=True)
    print(f"‚úÖ Nicht gefunden (Fused): {not_found_fused_path}")
    print(f"   Records: {len(df_not_found_fused):,}")

# 5. Comparison Statistics
combined_stats = {
    'original': stats_original,
    'fused': stats_fused,
    'improvements': {
        'isbn_count': int(stats_fused['vdeh_with_isbn'] - stats_original['vdeh_with_isbn']),
        'isbn_matches': int(stats_fused['isbn_matches'] - stats_original['isbn_matches']),
        'total_matches': int(stats_fused['total_matches'] - stats_original['total_matches']),
        'match_rate_improvement': float(stats_fused['match_rate'] - stats_original['match_rate'])
    }
}

stats_path = comparison_dir / 'comparison_statistics.json'
with open(stats_path, 'w', encoding='utf-8') as f:
    json.dump(combined_stats, f, indent=2, ensure_ascii=False)
print(f"\n‚úÖ Statistiken: {stats_path}")

# 6. Comparison Summary (CSV)
comparison_csv = comparison_dir / 'vorher_nachher_vergleich.csv'
comparison.to_csv(comparison_csv, index=False, encoding='utf-8-sig')
print(f"‚úÖ Vergleich CSV: {comparison_csv}")

print(f"\n{'='*80}")
print(f"‚úÖ VORHER-NACHHER EVALUATION ABGESCHLOSSEN!")
print(f"{'='*80}")
print(f"\nüìä Ergebnis-Zusammenfassung:")
print(f"   VORHER:  {stats_original['total_matches']:,} Matches ({stats_original['match_rate']:.1f}%)")
print(f"   NACHHER: {stats_fused['total_matches']:,} Matches ({stats_fused['match_rate']:.1f}%)")
print(f"   VERBESSERUNG: +{total_match_improvement:,} Matches ({match_rate_improvement:+.1f} Prozentpunkte)")
print(f"\n   Ergebnisse gespeichert in: {comparison_dir}")


üíæ SPEICHERE ERGEBNISSE

‚úÖ Original Matches: /media/sz/Data/Bibo/analysis/notebooks/03_comparison/results/vdeh_ub_matches_original.parquet
   Records: 4,428
   CSV: /media/sz/Data/Bibo/analysis/notebooks/03_comparison/results/vdeh_ub_matches_original.csv

‚úÖ Fused Matches: /media/sz/Data/Bibo/analysis/notebooks/03_comparison/results/vdeh_ub_matches_fused.parquet
   Records: 6,773
   CSV: /media/sz/Data/Bibo/analysis/notebooks/03_comparison/results/vdeh_ub_matches_fused.csv

‚úÖ Nicht gefunden (Original): /media/sz/Data/Bibo/analysis/notebooks/03_comparison/results/vdeh_not_in_ub_original.parquet
   Records: 53,877
‚úÖ Nicht gefunden (Fused): /media/sz/Data/Bibo/analysis/notebooks/03_comparison/results/vdeh_not_in_ub_fused.parquet
   Records: 51,532

‚úÖ Statistiken: /media/sz/Data/Bibo/analysis/notebooks/03_comparison/results/comparison_statistics.json
‚úÖ Vergleich CSV: /media/sz/Data/Bibo/analysis/notebooks/03_comparison/results/vorher_nachher_vergleich.csv

‚úÖ VORHER-NACHHER 