## üìä Data Loading and Initial Setup

This section handles the loading of the main dataset and performs initial data exploration to understand the structure and content of our romance books data.

### What this section does:
- Loads the main final dataset from CSV file
- Drops unnecessary columns to focus on core variables
- Performs detailed column-by-column analysis
- Identifies data types, missing values, and unique value counts
- Provides sample data for initial inspection

---

In [1]:
import pandas as pd
import numpy as np
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

### üîÑ Dataset Loading and Column Management

# Load dataset
main_final_path = "../../data/processed/romance_books_main_final.csv"
main_final = pd.read_csv(main_final_path)
logger.info(f"Loaded main final dataset: {len(main_final)} books")

# Drop specified columns
columns_to_drop = ['series_works_count', 'popular_shelves', 'genres', 'decade', 
                   'book_length_category', 'rating_category', 'popularity_category', 
                   'has_collection_indicators']
main_final = main_final.drop(columns=columns_to_drop, errors='ignore')
logger.info(f"Dropped columns: {[col for col in columns_to_drop if col in main_final.columns]}")

# Clean series_works_count_numeric: replace NaN with 'stand_alone'
main_final['series_works_count_numeric'] = main_final['series_works_count_numeric'].fillna('stand_alone')
logger.info(f"Replaced NaN values in series_works_count_numeric with 'stand_alone'")

### üìã Basic Dataset Information

# Display basic info
print(f"Dataset shape after dropping columns: {main_final.shape}")
print(f"\nRemaining column names:")
print(main_final.columns.tolist())
print(f"\nData types:")
print(main_final.dtypes)

### üîç Detailed Column Investigation

# Define ID columns to exclude from numerical analysis
id_columns = ['work_id', 'book_id_list_en', 'author_id', 'series_id']

for col in main_final.columns:
    print(f"\n{'='*60}")
    print(f"COLUMN: {col}")
    print(f"{'='*60}")
    
    # Basic info
    print(f"Data type: {main_final[col].dtype}")
    print(f"Non-null count: {main_final[col].count()} / {len(main_final)} ({main_final[col].count()/len(main_final)*100:.1f}%)")
    print(f"Null count: {main_final[col].isnull().sum()} ({main_final[col].isnull().sum()/len(main_final)*100:.1f}%)")
    
    # Mark ID columns
    if col in id_columns:
        print("üîë ID COLUMN - Excluded from numerical analysis")
    
    # Type-specific analysis
    if main_final[col].dtype in ['object']:
        print(f"Unique values: {main_final[col].nunique()}")
        print(f"Sample values:")
        sample_values = main_final[col].dropna().head(10).tolist()
        for i, val in enumerate(sample_values):
            val_str = str(val)
            if len(val_str) > 100:
                val_str = val_str[:100] + "..."
            print(f"  [{i+1}] {val_str}")
        
        # Check for list-like strings
        if any(main_final[col].dropna().astype(str).str.startswith('[').head(100)):
            print("  ‚ö†Ô∏è  Contains list-like strings - may need parsing")
        
        # Value length distribution for string columns
        lengths = main_final[col].dropna().astype(str).str.len()
        print(f"String length stats: min={lengths.min()}, max={lengths.max()}, mean={lengths.mean():.1f}")
        
    elif main_final[col].dtype in ['int64', 'float64'] and col not in id_columns:
        print(f"üìä NUMERICAL COLUMN - Valid for analysis")
        print(f"Basic stats:")
        stats = main_final[col].describe()
        for stat_name, stat_val in stats.items():
            print(f"  {stat_name}: {stat_val}")
        
        # Check for potential categorical numeric columns
        unique_count = main_final[col].nunique()
        if unique_count <= 20:
            print(f"Value counts (low cardinality - {unique_count} unique values):")
            vc = main_final[col].value_counts().head(10)
            for val, count in vc.items():
                print(f"  {val}: {count} ({count/len(main_final)*100:.1f}%)")
    
    elif main_final[col].dtype in ['int64', 'float64'] and col in id_columns:
        print(f"üîë ID COLUMN - Basic stats skipped")
        unique_count = main_final[col].nunique()
        print(f"Unique values: {unique_count}")
        
    elif main_final[col].dtype in ['bool']:
        print(f"Boolean distribution:")
        vc = main_final[col].value_counts()
        for val, count in vc.items():
            print(f"  {val}: {count} ({count/len(main_final)*100:.1f}%)")

### üìä Sample Data Preview

main_final.head()


INFO:__main__:Loaded main final dataset: 53349 books
INFO:__main__:Dropped columns: []
INFO:__main__:Replaced NaN values in series_works_count_numeric with 'stand_alone'


Dataset shape after dropping columns: (53349, 19)

Remaining column names:
['work_id', 'book_id_list_en', 'title', 'publication_year', 'num_pages_median', 'description', 'language_codes_en', 'author_id', 'author_name', 'author_average_rating', 'author_ratings_count', 'series_id', 'series_title', 'ratings_count_sum', 'text_reviews_count_sum', 'average_rating_weighted_mean', 'genres_str', 'shelves_str', 'series_works_count_numeric']

Data types:
work_id                           int64
book_id_list_en                  object
title                            object
publication_year                  int64
num_pages_median                float64
description                      object
language_codes_en                object
author_id                         int64
author_name                      object
author_average_rating           float64
author_ratings_count              int64
series_id                        object
series_title                     object
ratings_count_sum               

Unnamed: 0,work_id,book_id_list_en,title,publication_year,num_pages_median,description,language_codes_en,author_id,author_name,author_average_rating,author_ratings_count,series_id,series_title,ratings_count_sum,text_reviews_count_sum,average_rating_weighted_mean,genres_str,shelves_str,series_works_count_numeric
0,3237433,"['9416', '227650', '9423', '6088685', '1982627...",Confessions of a Shopaholic,2000,320.0,Unabridged audible download; approximately 11 ...,eng,6160,Sophie Kinsella,3.74,2169284,165735.0,Shopaholic,555675,10488,3.62,"fiction,romance,young adult","3-stars,5-stars,abandoned,adult-fiction,audio,...",12.0
1,1268663,"['3462', '6338758', '289110', '6386960', '1778...",The Rescue,2000,372.0,When confronted by raging fires or deadly acci...,eng,2345,Nicholas Sparks,4.06,4600277,stand_alone,stand_alone,148062,3150,4.1,"fiction,mystery,romance,young adult","2000,2001,2012-reads,adult,adult-fiction,alrea...",stand_alone
2,846763,"['110391', '6077588', '25322247', '1859059', '...",The Duke and I,2000,371.0,Can there be any greater challenge to London's...,eng,63898,Julia Quinn,3.98,567004,153045.0,Bridgertons,61848,2444,4.11,"biography,fiction,historical fiction,history,r...","19th-century,1st-in-series,2012-reads,2016-rea...",19.0
3,3363,"['861326', '6077587', '25322244', '353066', '9...",The Viscount Who Loved Me,2000,381.0,Alternate cover for ISBN: 0380815575/978038081...,eng,63898,Julia Quinn,3.98,567004,144491.0,Bridgertons,38086,1404,4.19,"biography,fiction,historical fiction,history,r...","1,19th-century,2016-reads,3-stars,4-stars,5-st...",19.0
4,2363,"['22649', '22655', '31107', '6560878', '257668...",Bookends,2000,368.0,On the heels of her national bestsellers Jemim...,eng,12915,Jane Green,3.58,502125,stand_alone,stand_alone,34139,842,3.7,"fiction,romance","2002,2003,2004,2005,2006,5-stars,abandoned,adu...",stand_alone


## üîß Universal String Canonicalization

This section performs comprehensive canonicalization of genre and shelf strings to create standardized, normalized versions for consistent analysis and comparison.

### What this section does:
- Applies consistent normalization rules to all genre and shelf strings
- Creates canonical mappings between original and normalized forms
- Handles case normalization, whitespace cleaning, and separator standardization
- Generates comprehensive statistics on transformation patterns
- Prepares clean, standardized data for downstream similarity analysis

In [None]:
# romance-novel-nlp-research/src/eda_analysis/cell3_similarity_embeddings_tuned.py

import os
import time
import json
import logging
from pathlib import Path
from typing import List, Tuple, Dict, Iterable, Optional

import numpy as np
import pandas as pd
import psutil
from joblib import Parallel, delayed
from difflib import SequenceMatcher
import hnswlib

# -----------------------
# Logging
# -----------------------
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[logging.FileHandler("cell3_pipeline.log", mode="w"), logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

def log_info(msg: str) -> None:
    print(msg); logger.info(msg)

def log_stage(stage: str) -> None:
    sep = "=" * 60
    print(sep); print(f"üöÄ {stage}"); print(sep)
    logger.info(sep); logger.info(f"üöÄ {stage}"); logger.info(sep)

def log_time(label: str, start: float) -> None:
    elapsed = time.time() - start
    print(f"‚è±Ô∏è {label} took {elapsed:.2f} s"); logger.info(f"{label}: {elapsed:.2f}s")

def log_memory() -> None:
    mem_gb = psutil.Process(os.getpid()).memory_info().rss / (1024**3)
    msg = f"üíæ Memory usage: {mem_gb:.2f} GB"
    print(msg); logger.info(msg)

# -----------------------
# Config
# -----------------------
MODEL_NAME: str = "sentence-transformers/all-MiniLM-L6-v2"  # 384-d, fast
USE_MULTILINGUAL: bool = False
BATCH_SIZE: int = 1024
NORMALIZE: bool = True

TOP_K_NEIGHBORS: int = 50
SIMILARITY_THRESHOLD: float = 0.30  # initial; will be overridden if tuning enabled

# HNSW
HNSW_M: int = 16
HNSW_EF_CONSTRUCTION: int = 200
HNSW_EF: int = 200
QUERY_BATCH_SIZE: int = 25000

# ---- Threshold tuning ----
ENABLE_THRESHOLD_TUNING: bool = True
EVAL_CSV_PATH: Optional[str] = None  # CSV with columns: token_a, token_b, label (1/0)
EVAL_MIN_PRECISION: Optional[float] = 0.90  # set None to disable constraint
EVAL_MIN_RECALL: Optional[float] = None     # set None to disable constraint
EVAL_GRID_START: float = 0.05
EVAL_GRID_END: float = 0.95
EVAL_GRID_STEP: float = 0.01

# In-memory eval fallback (used if CSV not provided). Define in the notebook before running:
# eval_positive_pairs = [("shelf a", "shelf a "), ...]
# eval_negative_pairs = [("shelf a", "shelf z"), ...]
# They are optional; if neither CSV nor lists exist, tuning is skipped.

OUTPUTS_DIR = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)

# -----------------------
# Inputs (from previous cells)
# -----------------------
try:
    canonical_tokens: List[str] = list(unique_canonical_shelves)  # noqa: F821
except NameError as e:
    raise RuntimeError("Expected `unique_canonical_shelves` to be defined in previous cells.") from e

# -----------------------
# Embedding utils
# -----------------------
def load_embedder(model_name: str):
    """Load Sentence-Transformers model."""
    try:
        from sentence_transformers import SentenceTransformer
    except Exception as exc:
        raise RuntimeError("Install with: pip install -U sentence-transformers") from exc
    device = "cuda" if os.environ.get("USE_CUDA", "1") == "1" else "cpu"
    return SentenceTransformer(model_name, device=device)

def encode_texts(embedder, texts: List[str], batch_size: int, normalize_vecs: bool) -> np.ndarray:
    """Batched encode; returns float32."""
    return embedder.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=normalize_vecs
    ).astype(np.float32, copy=False)

def cosine_sim(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Assumes L2-normalized rows; returns dot-product cosine."""
    return (a * b).sum(axis=1)

# -----------------------
# HNSW utils
# -----------------------
def build_hnsw(vectors: np.ndarray, m: int, ef_c: int, ef_q: int, topk: int):
    index = hnswlib.Index(space="cosine", dim=vectors.shape[1])
    index.init_index(max_elements=vectors.shape[0], ef_construction=ef_c, M=m)
    index.set_ef(max(ef_q, topk))
    index.add_items(vectors, np.arange(vectors.shape[0], dtype=np.int32))
    return index

def knn_query_batched(index, vectors: np.ndarray, k: int, batch: int) -> Tuple[np.ndarray, np.ndarray]:
    n = vectors.shape[0]
    all_idx = np.empty((n, k), dtype=np.int32)
    all_dist = np.empty((n, k), dtype=np.float32)
    s = 0
    while s < n:
        e = min(s + batch, n)
        idx, dist = index.knn_query(vectors[s:e], k=k)
        all_idx[s:e] = idx
        all_dist[s:e] = dist
        s = e
    return all_idx, all_dist

# -----------------------
# Eval / threshold tuning
# -----------------------
def load_eval_pairs_from_csv(path: str) -> Tuple[List[Tuple[str,str]], List[Tuple[str,str]]]:
    df = pd.read_csv(path)
    # Flexible column names
    cols = {c.lower(): c for c in df.columns}
    a = cols.get("token_a", cols.get("a", None))
    b = cols.get("token_b", cols.get("b", None))
    y = cols.get("label", cols.get("y", None))
    if not (a and b and y):
        raise ValueError("CSV must have columns token_a, token_b, label (1/0).")
    pos = [(x, y_) for x, y_, lbl in zip(df[a], df[b], df[y]) if int(lbl) == 1]
    neg = [(x, y_) for x, y_, lbl in zip(df[a], df[b], df[y]) if int(lbl) == 0]
    return pos, neg

def gather_eval_pairs_from_namespace() -> Tuple[List[Tuple[str,str]], List[Tuple[str,str]]]:
    """Pick up eval pairs if user defined `eval_positive_pairs` / `eval_negative_pairs` in notebook."""
    pos, neg = [], []
    g = globals()
    if "eval_positive_pairs" in g and isinstance(g["eval_positive_pairs"], Iterable):
        pos = [(str(a), str(b)) for a, b in g["eval_positive_pairs"]]
    if "eval_negative_pairs" in g and isinstance(g["eval_negative_pairs"], Iterable):
        neg = [(str(a), str(b)) for a, b in g["eval_negative_pairs"]]
    return pos, neg

def embed_eval_pairs(embedder, pairs_pos: List[Tuple[str,str]], pairs_neg: List[Tuple[str,str]], batch_size: int, normalize_vecs: bool) -> Tuple[np.ndarray, np.ndarray]:
    """Embed unique strings once; return (scores, labels)."""
    all_pairs = pairs_pos + pairs_neg
    if not all_pairs:
        return np.array([]), np.array([])
    uniq: Dict[str, int] = {}
    strings: List[str] = []
    for a, b in all_pairs:
        if a not in uniq:
            uniq[a] = len(strings); strings.append(a)
        if b not in uniq:
            uniq[b] = len(strings); strings.append(b)
    mat = encode_texts(embedder, strings, batch_size, normalize_vecs)  # normalized
    scores = np.empty(len(all_pairs), dtype=np.float32)
    labels = np.empty(len(all_pairs), dtype=np.int32)
    for i, (a, b) in enumerate(all_pairs):
        va = mat[uniq[a]][None, :]
        vb = mat[uniq[b]][None, :]
        scores[i] = cosine_sim(va, vb)[0]
        labels[i] = 1 if i < len(pairs_pos) else 0
    return scores, labels

def precision_recall_f1(y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, float, float]:
    tp = int(((y_true == 1) & (y_pred == 1)).sum())
    fp = int(((y_true == 0) & (y_pred == 1)).sum())
    fn = int(((y_true == 1) & (y_pred == 0)).sum())
    prec = tp / (tp + fp) if (tp + fp) else 0.0
    rec = tp / (tp + fn) if (tp + fn) else 0.0
    f1 = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0
    return prec, rec, f1

def sweep_thresholds(scores: np.ndarray, labels: np.ndarray, start: float, end: float, step: float) -> pd.DataFrame:
    """Return metrics across thresholds."""
    if scores.size == 0:
        return pd.DataFrame(columns=["threshold","precision","recall","f1","support_pos","support_neg"])
    thrs = np.arange(start, end + 1e-9, step, dtype=np.float32)
    rows = []
    pos_cnt = int((labels == 1).sum())
    neg_cnt = int((labels == 0).sum())
    for t in thrs:
        y_pred = (scores >= t).astype(np.int32)
        p, r, f1 = precision_recall_f1(labels, y_pred)
        rows.append((float(t), p, r, f1, pos_cnt, neg_cnt))
    return pd.DataFrame(rows, columns=["threshold","precision","recall","f1","support_pos","support_neg"])

def pick_threshold(metrics: pd.DataFrame,
                   min_precision: Optional[float],
                   min_recall: Optional[float]) -> Dict[str, float]:
    """Return chosen thresholds based on F1 and constraints."""
    if metrics.empty:
        return {"best_f1": None, "best_at_min_precision": None, "best_at_min_recall": None}
    # Best F1 (global)
    best_f1_row = metrics.iloc[metrics["f1"].values.argmax()]
    best_f1 = float(best_f1_row["threshold"])
    # Best under min precision
    best_at_min_p = None
    if isinstance(min_precision, float):
        sub = metrics[metrics["precision"] >= min_precision]
        if not sub.empty:
            best_at_min_p = float(sub.iloc[sub["f1"].values.argmax()]["threshold"])
    # Best under min recall
    best_at_min_r = None
    if isinstance(min_recall, float):
        sub = metrics[metrics["recall"] >= min_recall]
        if not sub.empty:
            best_at_min_r = float(sub.iloc[sub["f1"].values.argmax()]["threshold"])
    return {"best_f1": best_f1, "best_at_min_precision": best_at_min_p, "best_at_min_recall": best_at_min_r}

# -----------------------
# Pair generation on index
# -----------------------
def process_row(i: int, neighbors_idx: np.ndarray, neighbors_dist: np.ndarray, tokens: List[str], thr: float) -> List[dict]:
    res = []
    dists = neighbors_dist[i][1:]
    idxs = neighbors_idx[i][1:]
    for rank, (d, j) in enumerate(zip(dists, idxs), start=1):
        sim = 1.0 - float(d)
        if sim >= thr:
            res.append({"token_a": tokens[i], "token_b": tokens[int(j)], "cosine_sim": sim, "rank": rank})
    return res

def extra_metrics(pairs: List[dict]) -> None:
    for p in pairs:
        ratio = SequenceMatcher(None, p["token_a"], p["token_b"], autojunk=False).ratio()
        p["seq_ratio"] = round(ratio, 4)
        p["len_a"] = len(p["token_a"])
        p["len_b"] = len(p["token_b"])
        p["len_diff"] = abs(p["len_a"] - p["len_b"])
        p["df_ratio"] = 1.0

# -----------------------
# Main
# -----------------------
def main() -> None:
    # Model
    model = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" if USE_MULTILINGUAL else MODEL_NAME
    log_info("üìã CONFIGURATION (Embeddings + Tuning)")
    log_info(f"  Model: {model}")
    log_info(f"  Batch size: {BATCH_SIZE}  | Normalize: {NORMALIZE}")
    log_info(f"  HNSW: M={HNSW_M}, efC={HNSW_EF_CONSTRUCTION}, efQ={HNSW_EF}")
    log_info(f"  Top-K neighbors: {TOP_K_NEIGHBORS}")
    log_info(f"  Tokens: {len(canonical_tokens):,}")
    log_info(f"  Tuning: {ENABLE_THRESHOLD_TUNING} | CSV: {EVAL_CSV_PATH or 'None'} | minP={EVAL_MIN_PRECISION} | minR={EVAL_MIN_RECALL}")
    log_memory()

    # Load model
    log_stage("LOADING EMBEDDING MODEL")
    t0 = time.time()
    embedder = load_embedder(model)
    dim = embedder.get_sentence_embedding_dimension()
    log_info(f"‚úÖ Model loaded. Embedding dim: {dim}")
    log_time("Model load", t0)

    # Optional: threshold tuning
    tuned_threshold = None
    eval_metrics_path = OUTPUTS_DIR / "similarity_threshold_metrics.csv"
    eval_summary_path = OUTPUTS_DIR / "similarity_threshold_summary.json"
    if ENABLE_THRESHOLD_TUNING:
        log_stage("THRESHOLD TUNING (QUICK EVAL)")
        # Load eval pairs
        pos_pairs: List[Tuple[str,str]] = []
        neg_pairs: List[Tuple[str,str]] = []
        if EVAL_CSV_PATH:
            pos_pairs, neg_pairs = load_eval_pairs_from_csv(EVAL_CSV_PATH)
        else:
            pp, nn = gather_eval_pairs_from_namespace()
            pos_pairs, neg_pairs = pp, nn

        if (not pos_pairs) or (not neg_pairs):
            log_info("‚ö†Ô∏è  No eval pairs found. Skipping tuning and keeping configured threshold.")
        else:
            t_eval = time.time()
            scores, labels = embed_eval_pairs(embedder, pos_pairs, neg_pairs, BATCH_SIZE, NORMALIZE)
            metrics_df = sweep_thresholds(scores, labels, EVAL_GRID_START, EVAL_GRID_END, EVAL_GRID_STEP)
            choices = pick_threshold(metrics_df, EVAL_MIN_PRECISION, EVAL_MIN_RECALL)

            # Choose priority: min-precision -> min-recall -> best F1
            tuned_threshold = (
                choices["best_at_min_precision"]
                or choices["best_at_min_recall"]
                or choices["best_f1"]
            )
            metrics_df.to_csv(eval_metrics_path, index=False)
            with open(eval_summary_path, "w", encoding="utf-8") as f:
                json.dump(
                    {
                        "grid": [EVAL_GRID_START, EVAL_GRID_END, EVAL_GRID_STEP],
                        "min_precision": EVAL_MIN_PRECISION,
                        "min_recall": EVAL_MIN_RECALL,
                        "choices": choices,
                        "picked_threshold": tuned_threshold,
                        "pos_pairs": len(pos_pairs),
                        "neg_pairs": len(neg_pairs)
                    },
                    f,
                    indent=2
                )
            log_time("Tuning", t_eval)
            log_info(f"‚úÖ Tuned threshold = {tuned_threshold:.3f}  (choices={choices})")

    # Embeddings for full token set
    log_stage("EMBEDDING CANONICAL TOKENS")
    t1 = time.time()
    X = encode_texts(embedder, canonical_tokens, BATCH_SIZE, NORMALIZE)
    log_time("Embedding", t1)
    log_info(f"‚úÖ Embeddings: {X.shape[0]:,} √ó {X.shape[1]:,} (float32)")
    log_memory()

    # Build & query ANN
    log_stage("BUILDING ANN INDEX & RETRIEVING NEIGHBORS")
    k = min(TOP_K_NEIGHBORS + 1, X.shape[0])
    t2 = time.time()
    index = build_hnsw(X, HNSW_M, HNSW_EF_CONSTRUCTION, HNSW_EF, k)
    idx, dist = knn_query_batched(index, X, k=k, batch=QUERY_BATCH_SIZE)
    log_time("HNSW build + KNN", t2)
    log_info("‚úÖ ANN completed")
    log_memory()

    # Use tuned threshold if available
    threshold = float(tuned_threshold) if tuned_threshold is not None else float(SIMILARITY_THRESHOLD)
    log_info(f"üîé Using similarity threshold: {threshold:.3f}")

    # Generate candidate pairs
    log_stage("GENERATING CANDIDATE PAIRS")
    t3 = time.time()
    nested = Parallel(n_jobs=-1, verbose=1)(
        delayed(process_row)(i, idx, dist, canonical_tokens, threshold) for i in range(X.shape[0])
    )
    pairs = [p for sub in nested for p in sub]
    log_time("Candidate pair generation", t3)
    log_info(f"üìä Candidate pairs: {len(pairs):,}")
    log_memory()

    # Extra diagnostics
    log_stage("CALCULATING ADDITIONAL METRICS")
    extra_metrics(pairs)
    log_info(f"‚úÖ Metrics enriched: {len(pairs):,}")

    # Save outputs
    log_stage("SAVING RESULTS")
    pairs_df = pd.DataFrame(pairs)
    pairs_path = OUTPUTS_DIR / "candidate_similarity_pairs.parquet"
    sample_path = OUTPUTS_DIR / "similarity_sample_inspection.parquet"
    meta_path = OUTPUTS_DIR / "cell3_similarity_metadata.json"

    pairs_df.to_parquet(pairs_path, index=False)
    sample_n = min(5000, len(pairs))
    if sample_n > 0:
        pairs_df.sample(n=sample_n, random_state=42).to_parquet(sample_path, index=False)

    meta = {
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "config": {
            "model": model,
            "batch_size": BATCH_SIZE,
            "normalize": NORMALIZE,
            "top_k_neighbors": TOP_K_NEIGHBORS,
            "similarity_threshold_initial": SIMILARITY_THRESHOLD,
            "similarity_threshold_used": threshold,
            "hnsw": {"M": HNSW_M, "ef_construction": HNSW_EF_CONSTRUCTION, "ef": HNSW_EF},
            "query_batch_size": QUERY_BATCH_SIZE,
            "tuning": {
                "enabled": ENABLE_THRESHOLD_TUNING,
                "csv": EVAL_CSV_PATH,
                "min_precision": EVAL_MIN_PRECISION,
                "min_recall": EVAL_MIN_RECALL,
                "grid": [EVAL_GRID_START, EVAL_GRID_END, EVAL_GRID_STEP],
                "metrics_csv": str(eval_metrics_path) if ENABLE_THRESHOLD_TUNING else None,
                "summary_json": str(eval_summary_path) if ENABLE_THRESHOLD_TUNING else None
            }
        },
        "stats": {
            "total_tokens": len(canonical_tokens),
            "embedding_dim": int(X.shape[1]),
            "candidate_pairs_found": len(pairs)
        },
        "outputs": {
            "candidate_pairs_file": str(pairs_path),
            "sample_file": str(sample_path) if sample_n > 0 else None
        }
    }
    with open(meta_path, "w", encoding="utf-8") as f:
        json.dump(meta, f, indent=2, ensure_ascii=False)
    log_info(f"‚úÖ Saved pairs: {pairs_path}")
    if ENABLE_THRESHOLD_TUNING:
        log_info(f"‚úÖ Saved tuning metrics: {eval_metrics_path}")
        log_info(f"‚úÖ Saved tuning summary: {eval_summary_path}")
    log_info(f"‚úÖ Metadata: {meta_path}")

    # Summary
    log_stage("SUMMARY STATISTICS")
    if pairs:
        sims = [p["cosine_sim"] for p in pairs]
        ratios = [p["seq_ratio"] for p in pairs]
        log_info(f"Cosine similarity: min={min(sims):.3f}, mean={np.mean(sims):.3f}, max={max(sims):.3f}")
        log_info(f"SeqMatcher ratio: min={min(ratios):.3f}, mean={np.mean(ratios):.3f}, max={max(ratios):.3f}")
        covered = {p["token_a"] for p in pairs} | {p["token_b"] for p in pairs}
        cov = len(covered) / len(canonical_tokens) * 100
        log_info(f"Token coverage: {len(covered):,}/{len(canonical_tokens):,} ({cov:.1f}%)")
    log_info("üéâ Cell 3 completed with pretrained embeddings + auto threshold tuning!")

if __name__ == "__main__":
    main()

INFO:__main__:üìã CONFIGURATION (Embeddings + Tuning)
INFO:__main__:  Model: sentence-transformers/all-MiniLM-L6-v2
INFO:__main__:  Batch size: 1024  | Normalize: True
INFO:__main__:  HNSW: M=16, efC=200, efQ=200
INFO:__main__:  Top-K neighbors: 50
INFO:__main__:  Tokens: 254,778
INFO:__main__:  Tuning: True | CSV: None | minP=0.9 | minR=None
INFO:__main__:üíæ Memory usage: 3.90 GB
INFO:__main__:üöÄ LOADING EMBEDDING MODEL


üìã CONFIGURATION (Embeddings + Tuning)
  Model: sentence-transformers/all-MiniLM-L6-v2
  Batch size: 1024  | Normalize: True
  HNSW: M=16, efC=200, efQ=200
  Top-K neighbors: 50
  Tokens: 254,778
  Tuning: True | CSV: None | minP=0.9 | minR=None
üíæ Memory usage: 3.90 GB
üöÄ LOADING EMBEDDING MODEL


INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

INFO:__main__:‚úÖ Model loaded. Embedding dim: 384
INFO:__main__:Model load: 22.55s
INFO:__main__:üöÄ THRESHOLD TUNING (QUICK EVAL)
INFO:__main__:‚ö†Ô∏è  No eval pairs found. Skipping tuning and keeping configured threshold.
INFO:__main__:üöÄ EMBEDDING CANONICAL TOKENS


‚úÖ Model loaded. Embedding dim: 384
‚è±Ô∏è Model load took 22.55 s
üöÄ THRESHOLD TUNING (QUICK EVAL)
‚ö†Ô∏è  No eval pairs found. Skipping tuning and keeping configured threshold.
üöÄ EMBEDDING CANONICAL TOKENS


Batches:   0%|          | 0/249 [00:00<?, ?it/s]

INFO:__main__:Embedding: 70.72s
INFO:__main__:‚úÖ Embeddings: 254,778 √ó 384 (float32)
INFO:__main__:üíæ Memory usage: 5.61 GB
INFO:__main__:üöÄ BUILDING ANN INDEX & RETRIEVING NEIGHBORS


‚è±Ô∏è Embedding took 70.72 s
‚úÖ Embeddings: 254,778 √ó 384 (float32)
üíæ Memory usage: 5.61 GB
üöÄ BUILDING ANN INDEX & RETRIEVING NEIGHBORS


INFO:__main__:HNSW build + KNN: 160.52s
INFO:__main__:‚úÖ ANN completed
INFO:__main__:üíæ Memory usage: 6.12 GB
INFO:__main__:üîé Using similarity threshold: 0.300
INFO:__main__:üöÄ GENERATING CANDIDATE PAIRS


‚è±Ô∏è HNSW build + KNN took 160.52 s
‚úÖ ANN completed
üíæ Memory usage: 6.12 GB
üîé Using similarity threshold: 0.300
üöÄ GENERATING CANDIDATE PAIRS


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    3.6s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   18.6s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:   43.3s
[Parallel(n_jobs=-1)]: Done 776 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 1226 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 1776 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 2426 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 3176 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done 4026 tasks      | elapsed:  6.8min


In [2]:
# =============================================================================
# CELL 2: UNIVERSAL STRING CANONICALIZATION (v0)
# =============================================================================

import re
import time
from collections import defaultdict, Counter
import json
import os
from pathlib import Path

print(f"\n[{time.strftime('%H:%M:%S')}] üîß CELL 2: UNIVERSAL STRING CANONICALIZATION (v0)")
print("=" * 70)

### ‚öôÔ∏è Configuration Setup

# Configuration for canonicalization
CANONICAL_CONFIG = {
    'normalize_case': True,
    'remove_extra_whitespace': True,
    'remove_special_chars': False,  # Keep for genre/shelf analysis
    'standardize_separators': True,
    'min_token_length': 1,
    'max_token_length': 100
}

print(f"üìã CANONICALIZATION CONFIG:")
for key, value in CANONICAL_CONFIG.items():
    print(f"  {key}: {value}")

### üìö Genre Canonicalization

print(f"\nüìö GENRE CANONICALIZATION")
print("-" * 40)

def canonicalize_genre(genre_str):
    """
    Canonicalize a single genre string.
    
    Args:
        genre_str (str): Raw genre string
        
    Returns:
        str: Canonicalized genre string
    """
    if not isinstance(genre_str, str) or not genre_str.strip():
        return ""
    
    # Normalize case
    canonical = genre_str.lower() if CANONICAL_CONFIG['normalize_case'] else genre_str
    
    # Remove extra whitespace
    if CANONICAL_CONFIG['remove_extra_whitespace']:
        canonical = ' '.join(canonical.split())
    
    # Standardize separators (hyphens to spaces for consistency)
    if CANONICAL_CONFIG['standardize_separators']:
        canonical = re.sub(r'[-_]+', ' ', canonical)
        canonical = ' '.join(canonical.split())  # Clean up multiple spaces
    
    # Length validation
    if len(canonical) < CANONICAL_CONFIG['min_token_length'] or len(canonical) > CANONICAL_CONFIG['max_token_length']:
        return ""
    
    return canonical.strip()

# Apply canonicalization to unique genres
print(f"\nüîß PROCESSING GENRES:")
print(f"Canonicalizing genres from main_final dataset...")

# Extract unique genres from the dataset
unique_genres = set()
for idx, row in main_final.iterrows():
    if pd.notna(row.get('genres_str')) and row['genres_str'].strip():
        genres_list = [g.strip() for g in row['genres_str'].split(',') if g.strip()]
        unique_genres.update(genres_list)

print(f"Found {len(unique_genres):,} unique genres")

canonical_genres = {}
genre_mapping_stats = defaultdict(list)

for original_genre in unique_genres:
    canonical = canonicalize_genre(original_genre)
    canonical_genres[original_genre] = canonical
    
    # Track mapping for analysis
    if canonical != original_genre.lower():
        genre_mapping_stats['changed'].append((original_genre, canonical))
    else:
        genre_mapping_stats['unchanged'].append(original_genre)

print(f"  ‚úÖ Processed {len(canonical_genres):,} genres")

### üìö Shelf Canonicalization

print(f"\nüìö SHELF CANONICALIZATION")
print("-" * 40)

def canonicalize_shelf(shelf_str):
    """
    Canonicalize a single shelf string.
    
    Args:
        shelf_str (str): Raw shelf string
        
    Returns:
        str: Canonicalized shelf string
    """
    if not isinstance(shelf_str, str) or not shelf_str.strip():
        return ""
    
    # Normalize case
    canonical = shelf_str.lower() if CANONICAL_CONFIG['normalize_case'] else shelf_str
    
    # Remove extra whitespace
    if CANONICAL_CONFIG['remove_extra_whitespace']:
        canonical = ' '.join(canonical.split())
    
    # Standardize separators (hyphens to spaces for consistency)
    if CANONICAL_CONFIG['standardize_separators']:
        canonical = re.sub(r'[-_]+', ' ', canonical)
        canonical = ' '.join(canonical.split())  # Clean up multiple spaces
    
    # Length validation
    if len(canonical) < CANONICAL_CONFIG['min_token_length'] or len(canonical) > CANONICAL_CONFIG['max_token_length']:
        return ""
    
    return canonical.strip()

# Apply canonicalization to unique shelves
print(f"\nüîß PROCESSING SHELVES:")
print(f"Canonicalizing shelves from main_final dataset...")

# Extract unique shelves from the dataset
unique_shelves = set()
for idx, row in main_final.iterrows():
    if pd.notna(row.get('shelves_str')) and row['shelves_str'].strip():
        shelves_list = [s.strip() for s in row['shelves_str'].split(',') if s.strip()]
        unique_shelves.update(shelves_list)

print(f"Found {len(unique_shelves):,} unique shelves")

canonical_shelves = {}
shelf_mapping_stats = defaultdict(list)

for original_shelf in unique_shelves:
    canonical = canonicalize_shelf(original_shelf)
    canonical_shelves[original_shelf] = canonical
    
    # Track mapping for analysis
    if canonical != original_shelf.lower():
        shelf_mapping_stats['changed'].append((original_shelf, canonical))
    else:
        shelf_mapping_stats['unchanged'].append(original_shelf)

print(f"  ‚úÖ Processed {len(canonical_shelves):,} shelves")

### üìä Canonicalization Results

print(f"\nüìä CANONICALIZATION RESULTS:")
print("-" * 40)

# Get unique canonical values
unique_canonical_genres = set(canonical_genres.values())
unique_canonical_shelves = set(canonical_shelves.values())

# Calculate compression ratios
genre_compression_ratio = len(unique_canonical_genres) / len(unique_genres) if len(unique_genres) > 0 else 0
shelf_compression_ratio = len(unique_canonical_shelves) / len(unique_shelves) if len(unique_shelves) > 0 else 0

print(f"üìö GENRES:")
print(f"  Original count: {len(unique_genres):,}")
print(f"  Canonical count: {len(unique_canonical_genres):,}")
print(f"  Compression ratio: {genre_compression_ratio:.3f}")
print(f"  Changes made: {len(genre_mapping_stats['changed']):,}")
print(f"  Unchanged: {len(genre_mapping_stats['unchanged']):,}")

print(f"\nüìö SHELVES:")
print(f"  Original count: {len(unique_shelves):,}")
print(f"  Canonical count: {len(unique_canonical_shelves):,}")
print(f"  Compression ratio: {shelf_compression_ratio:.3f}")
print(f"  Changes made: {len(shelf_mapping_stats['changed']):,}")
print(f"  Unchanged: {len(shelf_mapping_stats['unchanged']):,}")

### üíæ Save Canonical Mappings

print(f"\nüíæ SAVING CANONICAL MAPPINGS:")
print("-" * 40)

# Create outputs directory
outputs_dir = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
outputs_dir.mkdir(parents=True, exist_ok=True)

# Save genre mappings
genre_mappings_df = pd.DataFrame([
    {'original': orig, 'canonical': canon} 
    for orig, canon in canonical_genres.items()
])
genre_mappings_path = outputs_dir / "genre_canonical_mappings.csv"
genre_mappings_df.to_csv(genre_mappings_path, index=False)
print(f"  ‚úÖ Saved genre mappings to: {genre_mappings_path}")

# Save shelf mappings
shelf_mappings_df = pd.DataFrame([
    {'original': orig, 'canonical': canon} 
    for orig, canon in canonical_shelves.items()
])
shelf_mappings_path = outputs_dir / "shelf_canonical_mappings.csv"
shelf_mappings_df.to_csv(shelf_mappings_path, index=False)
print(f"  ‚úÖ Saved shelf mappings to: {shelf_mappings_path}")

# Save canonicalization metadata
canonical_meta = {
    'timestamp': time.strftime('%Y-%m-%d %H:%M:%S'),
    'config': CANONICAL_CONFIG,
    'stats': {
        'genres': {
            'original_count': len(unique_genres),
            'canonical_count': len(unique_canonical_genres),
            'compression_ratio': genre_compression_ratio,
            'changes_count': len(genre_mapping_stats['changed']),
            'duplicates_eliminated': len(unique_genres) - len(unique_canonical_genres)
        },
        'shelves': {
            'original_count': len(unique_shelves),
            'canonical_count': len(unique_canonical_shelves),
            'compression_ratio': shelf_compression_ratio,
            'changes_count': len(shelf_mapping_stats['changed']),
            'duplicates_eliminated': len(unique_shelves) - len(unique_canonical_shelves)
        }
    }
}

metadata_path = outputs_dir / "canonicalization_metadata.json"
with open(metadata_path, 'w', encoding='utf-8') as f:
    json.dump(canonical_meta, f, indent=2, ensure_ascii=False)
print(f"  ‚úÖ Saved metadata to: {metadata_path}")

print(f"\n[{time.strftime('%H:%M:%S')}] ‚úÖ Cell 2: Universal String Canonicalization completed successfully!")



[22:01:11] üîß CELL 2: UNIVERSAL STRING CANONICALIZATION (v0)
üìã CANONICALIZATION CONFIG:
  normalize_case: True
  remove_extra_whitespace: True
  remove_special_chars: False
  standardize_separators: True
  min_token_length: 1
  max_token_length: 100

üìö GENRE CANONICALIZATION
----------------------------------------

üîß PROCESSING GENRES:
Canonicalizing genres from main_final dataset...
Found 13 unique genres
  ‚úÖ Processed 13 genres

üìö SHELF CANONICALIZATION
----------------------------------------

üîß PROCESSING SHELVES:
Canonicalizing shelves from main_final dataset...
Found 255,664 unique shelves
  ‚úÖ Processed 255,664 shelves

üìä CANONICALIZATION RESULTS:
----------------------------------------
üìö GENRES:
  Original count: 13
  Canonical count: 13
  Compression ratio: 1.000
  Changes made: 0
  Unchanged: 13

üìö SHELVES:
  Original count: 255,664
  Canonical count: 254,778
  Compression ratio: 0.997
  Changes made: 230,934
  Unchanged: 24,730

üíæ SAVING CA

## üîç Character Similarity Index & Neighbor Retrieval

This section builds a comprehensive character-based similarity index using TF-IDF vectorization and approximate nearest neighbor (ANN) search to identify potential duplicate shelf names.

### What this section does:
- Creates TF-IDF vectors from canonical shelf tokens using character n-grams
- Builds an approximate nearest neighbor index for efficient similarity search
- Retrieves candidate similar pairs based on cosine similarity thresholds
- Generates comprehensive statistics on similarity patterns and coverage
- Exports sample data for manual validation and quality assessment

In [None]:
! pip install hnswlib
! pip install -U sentence-transformers

In [9]:
# romance-novel-nlp-research/src/eda_analysis/cell3_similarity_embeddings_tuned.py

import os
import re
import time
import json
import random
import logging
from pathlib import Path
from typing import List, Tuple, Dict, Iterable, Optional

import numpy as np
import pandas as pd
import psutil
import hnswlib

# ======================
# Logging
# ======================
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[logging.FileHandler("cell3_pipeline.log", mode="w"), logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

def log_info(msg: str) -> None:
    print(msg)
    logger.info(msg)

def log_stage(stage: str) -> None:
    sep = "=" * 60
    print(sep); print(f"üöÄ {stage}"); print(sep)
    logger.info(sep); logger.info(f"üöÄ {stage}"); logger.info(sep)

def log_time(label: str, start: float) -> None:
    elapsed = time.time() - start
    print(f"‚è±Ô∏è {label} took {elapsed:.2f} s")
    logger.info(f"{label}: {elapsed:.2f}s")

def log_memory() -> None:
    mem_gb = psutil.Process(os.getpid()).memory_info().rss / (1024**3)
    msg = f"üíæ Memory usage: {mem_gb:.2f} GB"
    print(msg); logger.info(msg)

# ======================
# Config
# ======================
MODEL_NAME: str = "sentence-transformers/all-MiniLM-L6-v2"  # 384-d, fast
USE_MULTILINGUAL: bool = False                               # True ‚Üí paraphrase-multilingual-MiniLM-L12-v2
BATCH_SIZE: int = 1024
NORMALIZE: bool = True

TOP_K_NEIGHBORS: int = 50
SIMILARITY_THRESHOLD: float = 0.30  # initial; may be overridden by tuning

# HNSW
HNSW_M: int = 16
HNSW_EF_CONSTRUCTION: int = 200
HNSW_EF: int = 200
QUERY_BATCH_SIZE: int = 25000

# Threshold tuning
ENABLE_THRESHOLD_TUNING: bool = True
EVAL_CSV_PATH: Optional[str] = None   # CSV columns: token_a, token_b, label (1/0)
EVAL_MIN_PRECISION: Optional[float] = 0.90
EVAL_MIN_RECALL: Optional[float] = None
EVAL_GRID_START: float = 0.05
EVAL_GRID_END: float = 0.95
EVAL_GRID_STEP: float = 0.01

# Auto-generate eval pairs if none are supplied
AUTOGEN_EVAL_IF_MISSING: bool = True
AUTOGEN_TARGET_POS: int = 1500
AUTOGEN_TARGET_NEG: int = 1500
AUTOGEN_MAX_PAIRS_PER_GROUP: int = 3
AUTOGEN_BUCKET_PREFIX_LEN: int = 2
AUTOGEN_POS_MIN_RATIO: int = 90    # %
AUTOGEN_NEG_MAX_RATIO: int = 20    # %
AUTOGEN_RANDOM_SEED: int = 42
AUTOGEN_MAX_NEG_ITERS: int = 1_000_000

# Optional extra string metrics on the final pairs
ADD_STRING_METRICS: bool = False          # default off for speed
USE_RAPIDFUZZ: bool = True                # prefer RapidFuzz if available
METRICS_SAMPLE_N: Optional[int] = None    # e.g., 200_000 to cap; None = all

OUTPUTS_DIR = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)

# ======================
# Inputs (from previous cells)
# ======================
try:
    canonical_tokens: List[str] = list(unique_canonical_shelves)  # noqa: F821
except NameError as e:
    raise RuntimeError("Expected `unique_canonical_shelves` to be defined in previous cells.") from e

# ======================
# Embedding utils
# ======================
def load_embedder(model_name: str):
    try:
        from sentence_transformers import SentenceTransformer
    except Exception as exc:
        raise RuntimeError("Install with: pip install -U sentence-transformers") from exc
    device = "cuda" if os.environ.get("USE_CUDA", "1") == "1" else "cpu"
    return SentenceTransformer(model_name, device=device)

def encode_texts(embedder, texts: List[str], batch_size: int, normalize_vecs: bool) -> np.ndarray:
    return embedder.encode(
        texts,
        batch_size=batch_size,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=normalize_vecs
    ).astype(np.float32, copy=False)

def cosine_sim(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    return (a * b).sum(axis=1)

# ======================
# HNSW utils
# ======================
def build_hnsw(vectors: np.ndarray, m: int, ef_c: int, ef_q: int, topk: int):
    index = hnswlib.Index(space="cosine", dim=vectors.shape[1])
    index.init_index(max_elements=vectors.shape[0], ef_construction=ef_c, M=m)
    index.set_ef(max(ef_q, topk))
    index.add_items(vectors, np.arange(vectors.shape[0], dtype=np.int32))
    return index

def knn_query_batched(index, vectors: np.ndarray, k: int, batch: int) -> Tuple[np.ndarray, np.ndarray]:
    n = vectors.shape[0]
    all_idx = np.empty((n, k), dtype=np.int32)
    all_dist = np.empty((n, k), dtype=np.float32)
    s = 0
    while s < n:
        e = min(s + batch, n)
        idx, dist = index.knn_query(vectors[s:e], k=k)
        all_idx[s:e] = idx
        all_dist[s:e] = dist
        s = e
    return all_idx, all_dist

# ======================
# Eval / threshold tuning
# ======================
def load_eval_pairs_from_csv(path: str) -> Tuple[List[Tuple[str, str]], List[Tuple[str, str]]]:
    df = pd.read_csv(path)
    cols = {c.lower(): c for c in df.columns}
    a = cols.get("token_a", cols.get("a"))
    b = cols.get("token_b", cols.get("b"))
    y = cols.get("label", cols.get("y"))
    if not (a and b and y):
        raise ValueError("CSV must have columns token_a, token_b, label (1/0).")
    pos = [(str(x), str(y_)) for x, y_, lbl in zip(df[a], df[b], df[y]) if int(lbl) == 1]
    neg = [(str(x), str(y_)) for x, y_, lbl in zip(df[a], df[b], df[y]) if int(lbl) == 0]
    return pos, neg

def gather_eval_pairs_from_namespace() -> Tuple[List[Tuple[str, str]], List[Tuple[str, str]]]:
    pos, neg = [], []
    g = globals()
    if "eval_positive_pairs" in g and isinstance(g["eval_positive_pairs"], Iterable):
        pos = [(str(a), str(b)) for a, b in g["eval_positive_pairs"]]
    if "eval_negative_pairs" in g and isinstance(g["eval_negative_pairs"], Iterable):
        neg = [(str(a), str(b)) for a, b in g["eval_negative_pairs"]]
    return pos, neg

def embed_eval_pairs(embedder, pairs_pos: List[Tuple[str, str]], pairs_neg: List[Tuple[str, str]],
                     batch_size: int, normalize_vecs: bool) -> Tuple[np.ndarray, np.ndarray]:
    all_pairs = pairs_pos + pairs_neg
    if not all_pairs:
        return np.array([]), np.array([])
    uniq: Dict[str, int] = {}
    strings: List[str] = []
    for a, b in all_pairs:
        if a not in uniq:
            uniq[a] = len(strings); strings.append(a)
        if b not in uniq:
            uniq[b] = len(strings); strings.append(b)
    mat = encode_texts(embedder, strings, batch_size, normalize_vecs)
    scores = np.empty(len(all_pairs), dtype=np.float32)
    labels = np.empty(len(all_pairs), dtype=np.int32)
    for i, (a, b) in enumerate(all_pairs):
        va = mat[uniq[a]][None, :]
        vb = mat[uniq[b]][None, :]
        scores[i] = cosine_sim(va, vb)[0]
        labels[i] = 1 if i < len(pairs_pos) else 0
    return scores, labels

def precision_recall_f1(y_true: np.ndarray, y_pred: np.ndarray) -> Tuple[float, float, float]:
    tp = int(((y_true == 1) & (y_pred == 1)).sum())
    fp = int(((y_true == 0) & (y_pred == 1)).sum())
    fn = int(((y_true == 1) & (y_pred == 0)).sum())
    prec = tp / (tp + fp) if (tp + fp) else 0.0
    rec = tp / (tp + fn) if (tp + fn) else 0.0
    f1 = 2 * prec * rec / (prec + rec) if (prec + rec) else 0.0
    return prec, rec, f1

def sweep_thresholds(scores: np.ndarray, labels: np.ndarray, start: float, end: float, step: float) -> pd.DataFrame:
    if scores.size == 0:
        return pd.DataFrame(columns=["threshold", "precision", "recall", "f1", "support_pos", "support_neg"])
    thrs = np.arange(start, end + 1e-9, step, dtype=np.float32)
    rows = []
    pos_cnt = int((labels == 1).sum())
    neg_cnt = int((labels == 0).sum())
    for t in thrs:
        y_pred = (scores >= t).astype(np.int32)
        p, r, f1 = precision_recall_f1(labels, y_pred)
        rows.append((float(t), p, r, f1, pos_cnt, neg_cnt))
    return pd.DataFrame(rows, columns=["threshold", "precision", "recall", "f1", "support_pos", "support_neg"])

def pick_threshold(metrics: pd.DataFrame,
                   min_precision: Optional[float],
                   min_recall: Optional[float]) -> Dict[str, Optional[float]]:
    if metrics.empty:
        return {"best_f1": None, "best_at_min_precision": None, "best_at_min_recall": None}
    best_f1_row = metrics.iloc[metrics["f1"].values.argmax()]
    best_f1 = float(best_f1_row["threshold"])
    best_at_min_p = None
    if isinstance(min_precision, float):
        sub = metrics[metrics["precision"] >= min_precision]
        if not sub.empty:
            best_at_min_p = float(sub.iloc[sub["f1"].values.argmax()]["threshold"])
    best_at_min_r = None
    if isinstance(min_recall, float):
        sub = metrics[metrics["recall"] >= min_recall]
        if not sub.empty:
            best_at_min_r = float(sub.iloc[sub["f1"].values.argmax()]["threshold"])
    return {"best_f1": best_f1, "best_at_min_precision": best_at_min_p, "best_at_min_recall": best_at_min_r}

# ======================
# Autogen eval pairs (when missing)
# ======================
def _norm_key(s: str) -> str:
    s = s.lower().strip()
    s = re.sub(r"\s+", " ", s)
    s = re.sub(r"[^a-z0-9 ]+", "", s)
    return s

def _ratio(a: str, b: str) -> int:
    if USE_RAPIDFUZZ:
        try:
            from rapidfuzz.fuzz import ratio as fuzz_ratio  # type: ignore
            return int(fuzz_ratio(a, b))
        except Exception:
            pass
    from difflib import SequenceMatcher
    return int(SequenceMatcher(None, a, b, autojunk=False).ratio() * 100)

def autogen_eval_pairs(tokens: List[str]) -> Tuple[List[Tuple[str, str]], List[Tuple[str, str]], Dict[str, int]]:
    random.seed(AUTOGEN_RANDOM_SEED)
    # Positives from normalized-key groups
    pos: List[Tuple[str, str]] = []
    groups: Dict[str, List[str]] = {}
    for t in tokens:
        groups.setdefault(_norm_key(t), []).append(t)
    for key, arr in groups.items():
        if len(arr) < 2 or not key:
            continue
        arr = list(dict.fromkeys(arr))
        for other in arr[1:1 + AUTOGEN_MAX_PAIRS_PER_GROUP]:
            pos.append((arr[0], other))
        if len(pos) >= AUTOGEN_TARGET_POS:
            break
    # Top-up positives within small prefix buckets
    if len(pos) < AUTOGEN_TARGET_POS:
        buckets: Dict[str, List[str]] = {}
        for t in tokens:
            buckets.setdefault(t[:AUTOGEN_BUCKET_PREFIX_LEN].lower(), []).append(t)
        for arr in buckets.values():
            if len(pos) >= AUTOGEN_TARGET_POS:
                break
            if len(arr) < 2:
                continue
            sample = arr[:256]
            for i in range(min(len(sample), 32)):
                for j in range(i + 1, min(len(sample), 32)):
                    a, b = sample[i], sample[j]
                    if abs(len(a) - len(b)) > max(5, 0.4 * max(len(a), len(b))):
                        continue
                    if _ratio(a, b) >= AUTOGEN_POS_MIN_RATIO:
                        pos.append((a, b))
                        if len(pos) >= AUTOGEN_TARGET_POS:
                            break
                if len(pos) >= AUTOGEN_TARGET_POS:
                    break
    # Dedup positives
    pos = list(dict.fromkeys(tuple(sorted(p)) for p in pos))
    pos = [(a, b) for a, b in pos][:AUTOGEN_TARGET_POS]

    # Negatives: random dissimilar pairs
    neg: List[Tuple[str, str]] = []
    uniq_tokens = list(dict.fromkeys(tokens))
    n = len(uniq_tokens)
    seen = set()
    iters = 0
    while len(neg) < AUTOGEN_TARGET_NEG and iters < AUTOGEN_MAX_NEG_ITERS:
        iters += 1
        i, j = random.randrange(n), random.randrange(n)
        if i == j:
            continue
        a, b = uniq_tokens[i], uniq_tokens[j]
        key = (a, b) if a < b else (b, a)
        if key in seen:
            continue
        if a[:AUTOGEN_BUCKET_PREFIX_LEN].lower() == b[:AUTOGEN_BUCKET_PREFIX_LEN].lower():
            continue
        if _norm_key(a) == _norm_key(b):
            continue
        if _ratio(a, b) <= AUTOGEN_NEG_MAX_RATIO:
            seen.add(key)
            neg.append((a, b))
    stats = {"neg_iters": iters}
    return pos, neg, stats

# ======================
# Vectorized candidate pair generation
# ======================
def generate_pairs_fast(
    neighbors_idx: np.ndarray,
    neighbors_dist: np.ndarray,
    tokens: List[str],
    sim_threshold: float
) -> pd.DataFrame:
    n, k = neighbors_idx.shape
    if k < 2:
        return pd.DataFrame(columns=["token_a", "token_b", "cosine_sim", "rank"])
    idx_sub = neighbors_idx[:, 1:]
    sims = 1.0 - neighbors_dist[:, 1:]
    mask = sims >= sim_threshold
    counts = mask.sum(axis=1).astype(np.int64)
    total = int(counts.sum())
    if total == 0:
        return pd.DataFrame(columns=["token_a", "token_b", "cosine_sim", "rank"])
    row_ids = np.repeat(np.arange(n, dtype=np.int32), counts)
    col_ids = idx_sub[mask].astype(np.int32, copy=False)
    sim_vals = sims[mask].astype(np.float32, copy=False)
    offsets = np.empty(n + 1, dtype=np.int64)
    offsets[0] = 0
    np.cumsum(counts, out=offsets[1:])
    ranks = np.empty(total, dtype=np.int32)
    for i in range(n):  # intentional tiny loop; assigning ranks per row
        s, e = offsets[i], offsets[i + 1]
        if e > s:
            ranks[s:e] = np.arange(1, e - s + 1, dtype=np.int32)
    tok = np.asarray(tokens, dtype=object)
    return pd.DataFrame({
        "token_a": tok[row_ids],
        "token_b": tok[col_ids],
        "cosine_sim": sim_vals,
        "rank": ranks
    })

def maybe_add_string_metrics(df: pd.DataFrame) -> pd.DataFrame:
    if df.empty:
        return df
    target = df if METRICS_SAMPLE_N is None else df.sample(n=min(METRICS_SAMPLE_N, len(df)), random_state=42).copy()
    if USE_RAPIDFUZZ:
        try:
            from rapidfuzz.fuzz import ratio as fuzz_ratio  # type: ignore
            target["seq_ratio"] = [
                round(fuzz_ratio(a, b) / 100.0, 4) for a, b in zip(target["token_a"], target["token_b"])
            ]
        except Exception:
            from difflib import SequenceMatcher
            target["seq_ratio"] = [
                round(SequenceMatcher(None, a, b, autojunk=False).ratio(), 4)
                for a, b in zip(target["token_a"], target["token_b"])
            ]
    else:
        from difflib import SequenceMatcher
        target["seq_ratio"] = [
            round(SequenceMatcher(None, a, b, autojunk=False).ratio(), 4)
            for a, b in zip(target["token_a"], target["token_b"])
        ]
    target["len_a"] = target["token_a"].str.len()
    target["len_b"] = target["token_b"].str.len()
    target["len_diff"] = (target["len_a"] - target["len_b"]).abs()
    return df.merge(
        target[["token_a", "token_b", "seq_ratio", "len_a", "len_b", "len_diff"]],
        on=["token_a", "token_b"], how="left"
    )

# ======================
# Main
# ======================
def main() -> None:
    model = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" if USE_MULTILINGUAL else MODEL_NAME

    log_info("üìã CONFIGURATION (Embeddings + Tuning + Fast Pairs)")
    log_info(f"  Model: {model}")
    log_info(f"  Batch size: {BATCH_SIZE} | Normalize: {NORMALIZE}")
    log_info(f"  HNSW: M={HNSW_M}, efC={HNSW_EF_CONSTRUCTION}, efQ={HNSW_EF}")
    log_info(f"  Top-K neighbors: {TOP_K_NEIGHBORS}")
    log_info(f"  Tokens: {len(canonical_tokens):,}")
    log_info(f"  Tuning: {ENABLE_THRESHOLD_TUNING} | CSV: {EVAL_CSV_PATH or 'None'} | minP={EVAL_MIN_PRECISION} | minR={EVAL_MIN_RECALL}")
    log_info(f"  Autogen eval: {AUTOGEN_EVAL_IF_MISSING} (pos={AUTOGEN_TARGET_POS}, neg={AUTOGEN_TARGET_NEG})")
    log_info(f"  Extra metrics: {ADD_STRING_METRICS} | RapidFuzz: {USE_RAPIDFUZZ} | sample={METRICS_SAMPLE_N}")
    log_memory()

    # Load model
    log_stage("LOADING EMBEDDING MODEL")
    t0 = time.time()
    embedder = load_embedder(model)
    dim = embedder.get_sentence_embedding_dimension()
    log_info(f"‚úÖ Model loaded. Embedding dim: {dim}")
    log_time("Model load", t0)

    # Threshold tuning
    tuned_threshold = None
    eval_used_autogen = False
    eval_metrics_path = OUTPUTS_DIR / "similarity_threshold_metrics.csv"
    eval_summary_path = OUTPUTS_DIR / "similarity_threshold_summary.json"

    if ENABLE_THRESHOLD_TUNING:
        log_stage("THRESHOLD TUNING (QUICK EVAL)")
        pos_pairs: List[Tuple[str, str]] = []
        neg_pairs: List[Tuple[str, str]] = []

        if EVAL_CSV_PATH:
            pos_pairs, neg_pairs = load_eval_pairs_from_csv(EVAL_CSV_PATH)
        else:
            pp, nn = gather_eval_pairs_from_namespace()
            pos_pairs, neg_pairs = pp, nn

        if (not pos_pairs or not neg_pairs) and AUTOGEN_EVAL_IF_MISSING:
            log_info("‚ÑπÔ∏è No eval pairs supplied. Auto-generating a small eval set from tokens...")
            pos_pairs, neg_pairs, stats = autogen_eval_pairs([t for t in map(str, canonical_tokens) if t])
            eval_used_autogen = True
            log_info(f"‚úÖ Auto-generated eval pairs: pos={len(pos_pairs)}, neg={len(neg_pairs)} (neg iters={stats['neg_iters']})")

        if (not pos_pairs) or (not neg_pairs):
            log_info("‚ö†Ô∏è  No eval pairs found. Skipping tuning and keeping configured threshold.")
        else:
            t_eval = time.time()
            scores, labels = embed_eval_pairs(embedder, pos_pairs, neg_pairs, BATCH_SIZE, NORMALIZE)
            metrics_df = sweep_thresholds(scores, labels, EVAL_GRID_START, EVAL_GRID_END, EVAL_GRID_STEP)
            choices = pick_threshold(metrics_df, EVAL_MIN_PRECISION, EVAL_MIN_RECALL)
            tuned_threshold = choices["best_at_min_precision"] or choices["best_at_min_recall"] or choices["best_f1"]
            metrics_df.to_csv(eval_metrics_path, index=False)
            with open(eval_summary_path, "w", encoding="utf-8") as f:
                json.dump({
                    "grid": [EVAL_GRID_START, EVAL_GRID_END, EVAL_GRID_STEP],
                    "min_precision": EVAL_MIN_PRECISION,
                    "min_recall": EVAL_MIN_RECALL,
                    "choices": choices,
                    "picked_threshold": tuned_threshold,
                    "pos_pairs": len(pos_pairs),
                    "neg_pairs": len(neg_pairs),
                    "autogen_used": eval_used_autogen
                }, f, indent=2)
            log_time("Tuning", t_eval)
            log_info(f"‚úÖ Tuned threshold = {tuned_threshold:.3f}  (choices={choices})")

    # Embeddings for full token set
    log_stage("EMBEDDING CANONICAL TOKENS")
    t1 = time.time()
    X = encode_texts(embedder, canonical_tokens, BATCH_SIZE, NORMALIZE)  # shape: (n, dim)
    log_time("Embedding", t1)
    log_info(f"‚úÖ Embeddings: {X.shape[0]:,} √ó {X.shape[1]:,} (float32)")
    log_memory()

    # Build & query ANN
    log_stage("BUILDING ANN INDEX & RETRIEVING NEIGHBORS")
    k = min(TOP_K_NEIGHBORS + 1, X.shape[0])
    t2 = time.time()
    index = build_hnsw(X, HNSW_M, HNSW_EF_CONSTRUCTION, HNSW_EF, k)
    idx, dist = knn_query_batched(index, X, k=k, batch=QUERY_BATCH_SIZE)
    log_time("HNSW build + KNN", t2)
    log_info("‚úÖ ANN completed")
    log_memory()

    # Use tuned threshold if available
    threshold = float(tuned_threshold) if tuned_threshold is not None else float(SIMILARITY_THRESHOLD)
    log_info(f"üîé Using similarity threshold: {threshold:.3f}")

    # Generate candidate pairs (vectorized)
    log_stage("GENERATING CANDIDATE PAIRS (FAST, VECTORIZED)")
    t3 = time.time()
    pairs_df = generate_pairs_fast(neighbors_idx=idx, neighbors_dist=dist,
                                   tokens=list(canonical_tokens), sim_threshold=threshold)
    log_time("Candidate pair generation (vectorized)", t3)
    log_info(f"üìä Candidate pairs: {len(pairs_df):,}")
    log_memory()

    # Optional extra metrics
    if ADD_STRING_METRICS and not pairs_df.empty:
        log_stage("ADDING STRING METRICS (OPTIONAL)")
        t4 = time.time()
        pairs_df = maybe_add_string_metrics(pairs_df)
        log_time("Extra string metrics", t4)
        log_info("‚úÖ Metrics added")

    # Save outputs
    log_stage("SAVING RESULTS")
    pairs_path = OUTPUTS_DIR / "candidate_similarity_pairs.parquet"
    sample_path = OUTPUTS_DIR / "similarity_sample_inspection.parquet"
    meta_path = OUTPUTS_DIR / "cell3_similarity_metadata.json"

    pairs_df.to_parquet(pairs_path, index=False)
    sample_n = min(5000, len(pairs_df))
    if sample_n > 0:
        pairs_df.sample(n=sample_n, random_state=42).to_parquet(sample_path, index=False)

    meta = {
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "config": {
            "model": model,
            "batch_size": BATCH_SIZE,
            "normalize": NORMALIZE,
            "top_k_neighbors": TOP_K_NEIGHBORS,
            "similarity_threshold_initial": SIMILARITY_THRESHOLD,
            "similarity_threshold_used": threshold,
            "hnsw": {"M": HNSW_M, "ef_construction": HNSW_EF_CONSTRUCTION, "ef": HNSW_EF},
            "query_batch_size": QUERY_BATCH_SIZE,
            "tuning": {
                "enabled": ENABLE_THRESHOLD_TUNING,
                "csv": EVAL_CSV_PATH,
                "min_precision": EVAL_MIN_PRECISION,
                "min_recall": EVAL_MIN_RECALL,
                "grid": [EVAL_GRID_START, EVAL_GRID_END, EVAL_GRID_STEP],
                "metrics_csv": str(eval_metrics_path) if ENABLE_THRESHOLD_TUNING else None,
                "summary_json": str(eval_summary_path) if ENABLE_THRESHOLD_TUNING else None,
                "autogen_used": eval_used_autogen
            },
            "extras": {
                "add_string_metrics": ADD_STRING_METRICS,
                "use_rapidfuzz": USE_RAPIDFUZZ,
                "metrics_sample_n": METRICS_SAMPLE_N
            }
        },
        "stats": {
            "total_tokens": len(canonical_tokens),
            "embedding_dim": int(X.shape[1]),
            "candidate_pairs_found": len(pairs_df)
        },
        "outputs": {
            "candidate_pairs_file": str(pairs_path),
            "sample_file": str(sample_path) if sample_n > 0 else None
        }
    }
    with open(meta_path, "w", encoding="utf-8") as f:
        json.dump(meta, f, indent=2, ensure_ascii=False)
    log_info(f"‚úÖ Saved pairs: {pairs_path}")
    if sample_n > 0:
        log_info(f"‚úÖ Saved sample: {sample_path}")
    if ENABLE_THRESHOLD_TUNING:
        if eval_used_autogen:
            log_info(f"‚úÖ Saved tuning metrics: {eval_metrics_path} (auto-generated eval set)")
            log_info(f"‚úÖ Saved tuning summary: {eval_summary_path}")
        else:
            log_info(f"‚úÖ Saved tuning metrics: {eval_metrics_path}")
            log_info(f"‚úÖ Saved tuning summary: {eval_summary_path}")
    log_info(f"‚úÖ Metadata: {meta_path}")

    # Summary
    log_stage("SUMMARY STATISTICS")
    if not pairs_df.empty:
        sims = pairs_df["cosine_sim"].to_numpy()
        log_info(f"Cosine similarity: min={float(sims.min()):.3f}, mean={float(sims.mean()):.3f}, max={float(sims.max()):.3f}")
        covered = len(set(pairs_df["token_a"]).union(pairs_df["token_b"])) / len(canonical_tokens) * 100
        log_info(f"Token coverage: {covered:.1f}%")
    log_info("üéâ Cell 3 completed with pretrained embeddings, auto-tuning, and fast pair generation!")

if __name__ == "__main__":
    main()

INFO:__main__:üìã CONFIGURATION (Embeddings + Tuning + Fast Pairs)
INFO:__main__:  Model: sentence-transformers/all-MiniLM-L6-v2
INFO:__main__:  Batch size: 1024 | Normalize: True
INFO:__main__:  HNSW: M=16, efC=200, efQ=200
INFO:__main__:  Top-K neighbors: 50
INFO:__main__:  Tokens: 254,778
INFO:__main__:  Tuning: True | CSV: None | minP=0.9 | minR=None
INFO:__main__:  Autogen eval: True (pos=1500, neg=1500)
INFO:__main__:  Extra metrics: False | RapidFuzz: True | sample=None
INFO:__main__:üíæ Memory usage: 6.18 GB
INFO:__main__:üöÄ LOADING EMBEDDING MODEL
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2


üìã CONFIGURATION (Embeddings + Tuning + Fast Pairs)
  Model: sentence-transformers/all-MiniLM-L6-v2
  Batch size: 1024 | Normalize: True
  HNSW: M=16, efC=200, efQ=200
  Top-K neighbors: 50
  Tokens: 254,778
  Tuning: True | CSV: None | minP=0.9 | minR=None
  Autogen eval: True (pos=1500, neg=1500)
  Extra metrics: False | RapidFuzz: True | sample=None
üíæ Memory usage: 6.18 GB
üöÄ LOADING EMBEDDING MODEL


INFO:__main__:‚úÖ Model loaded. Embedding dim: 384
INFO:__main__:Model load: 4.08s
INFO:__main__:üöÄ THRESHOLD TUNING (QUICK EVAL)
INFO:__main__:‚ÑπÔ∏è No eval pairs supplied. Auto-generating a small eval set from tokens...


‚úÖ Model loaded. Embedding dim: 384
‚è±Ô∏è Model load took 4.08 s
üöÄ THRESHOLD TUNING (QUICK EVAL)
‚ÑπÔ∏è No eval pairs supplied. Auto-generating a small eval set from tokens...


INFO:__main__:‚úÖ Auto-generated eval pairs: pos=633, neg=1500 (neg iters=7070)


‚úÖ Auto-generated eval pairs: pos=633, neg=1500 (neg iters=7070)


Batches:   0%|          | 0/4 [00:00<?, ?it/s]

INFO:__main__:Tuning: 0.71s
INFO:__main__:‚úÖ Tuned threshold = 0.450  (choices={'best_f1': 0.44999992847442627, 'best_at_min_precision': 0.44999992847442627, 'best_at_min_recall': None})
INFO:__main__:üöÄ EMBEDDING CANONICAL TOKENS


‚è±Ô∏è Tuning took 0.71 s
‚úÖ Tuned threshold = 0.450  (choices={'best_f1': 0.44999992847442627, 'best_at_min_precision': 0.44999992847442627, 'best_at_min_recall': None})
üöÄ EMBEDDING CANONICAL TOKENS


Batches:   0%|          | 0/249 [00:00<?, ?it/s]

INFO:__main__:Embedding: 80.62s
INFO:__main__:‚úÖ Embeddings: 254,778 √ó 384 (float32)
INFO:__main__:üíæ Memory usage: 6.60 GB
INFO:__main__:üöÄ BUILDING ANN INDEX & RETRIEVING NEIGHBORS


‚è±Ô∏è Embedding took 80.62 s
‚úÖ Embeddings: 254,778 √ó 384 (float32)
üíæ Memory usage: 6.60 GB
üöÄ BUILDING ANN INDEX & RETRIEVING NEIGHBORS


INFO:__main__:HNSW build + KNN: 136.21s
INFO:__main__:‚úÖ ANN completed
INFO:__main__:üíæ Memory usage: 7.06 GB
INFO:__main__:üîé Using similarity threshold: 0.450
INFO:__main__:üöÄ GENERATING CANDIDATE PAIRS (FAST, VECTORIZED)


‚è±Ô∏è HNSW build + KNN took 136.21 s
‚úÖ ANN completed
üíæ Memory usage: 7.06 GB
üîé Using similarity threshold: 0.450
üöÄ GENERATING CANDIDATE PAIRS (FAST, VECTORIZED)


INFO:__main__:Candidate pair generation (vectorized): 3.97s
INFO:__main__:üìä Candidate pairs: 12,502,616
INFO:__main__:üíæ Memory usage: 7.34 GB
INFO:__main__:üöÄ SAVING RESULTS


‚è±Ô∏è Candidate pair generation (vectorized) took 3.97 s
üìä Candidate pairs: 12,502,616
üíæ Memory usage: 7.34 GB
üöÄ SAVING RESULTS


INFO:__main__:‚úÖ Saved pairs: romance-novel-nlp-research/src/eda_analysis/outputs/candidate_similarity_pairs.parquet
INFO:__main__:‚úÖ Saved sample: romance-novel-nlp-research/src/eda_analysis/outputs/similarity_sample_inspection.parquet
INFO:__main__:‚úÖ Saved tuning metrics: romance-novel-nlp-research/src/eda_analysis/outputs/similarity_threshold_metrics.csv (auto-generated eval set)
INFO:__main__:‚úÖ Saved tuning summary: romance-novel-nlp-research/src/eda_analysis/outputs/similarity_threshold_summary.json
INFO:__main__:‚úÖ Metadata: romance-novel-nlp-research/src/eda_analysis/outputs/cell3_similarity_metadata.json
INFO:__main__:üöÄ SUMMARY STATISTICS
INFO:__main__:Cosine similarity: min=0.450, mean=0.678, max=1.000
INFO:__main__:Token coverage: 100.0%


‚úÖ Saved pairs: romance-novel-nlp-research/src/eda_analysis/outputs/candidate_similarity_pairs.parquet
‚úÖ Saved sample: romance-novel-nlp-research/src/eda_analysis/outputs/similarity_sample_inspection.parquet
‚úÖ Saved tuning metrics: romance-novel-nlp-research/src/eda_analysis/outputs/similarity_threshold_metrics.csv (auto-generated eval set)
‚úÖ Saved tuning summary: romance-novel-nlp-research/src/eda_analysis/outputs/similarity_threshold_summary.json
‚úÖ Metadata: romance-novel-nlp-research/src/eda_analysis/outputs/cell3_similarity_metadata.json
üöÄ SUMMARY STATISTICS
Cosine similarity: min=0.450, mean=0.678, max=1.000
Token coverage: 100.0%


INFO:__main__:üéâ Cell 3 completed with pretrained embeddings, auto-tuning, and fast pair generation!


üéâ Cell 3 completed with pretrained embeddings, auto-tuning, and fast pair generation!


In [10]:
# romance-novel-nlp-research/src/eda_analysis/cell3a_explore_outputs.py
# Heavy-print exploratory audit of Cell 3 outputs. Safe on large parquet via DuckDB/Arrow.

from __future__ import annotations
import os, json, math, time, shutil, sys
from pathlib import Path
from typing import Optional

import pandas as pd
import numpy as np

# ---------- Paths ----------
OUT = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
PAIRS = OUT / "candidate_similarity_pairs.parquet"
SAMPLE = OUT / "similarity_sample_inspection.parquet"
METRICS_CSV = OUT / "similarity_threshold_metrics.csv"
SUMMARY_JSON = OUT / "similarity_threshold_summary.json"
META_JSON = OUT / "cell3_similarity_metadata.json"

# ---------- Utils ----------
def human(n: int) -> str:
    units = ["B","KB","MB","GB","TB"]
    i = 0
    x = float(n)
    while x >= 1024 and i < len(units)-1:
        x /= 1024; i += 1
    return f"{x:.2f} {units[i]}"

def p(s: str) -> None:
    print(s); sys.stdout.flush()

def print_header(title: str) -> None:
    sep = "=" * 80
    p(sep); p(f"üîé {title}"); p(sep)

def file_info(path: Path) -> str:
    return f"{path} | exists={path.exists()} | size={human(path.stat().st_size) if path.exists() else '‚Äî'}"

def try_import(mod: str):
    try:
        return __import__(mod)
    except Exception:
        return None

duckdb = try_import("duckdb")
pyarrow = try_import("pyarrow")
pa_ds = None
if pyarrow:
    try:
        import pyarrow.dataset as pa_ds
    except Exception:
        pa_ds = None

try:
    from rapidfuzz.fuzz import ratio as fuzz_ratio
    HAS_RAPIDFUZZ = True
except Exception:
    HAS_RAPIDFUZZ = False

# ---------- 0) Print file inventory ----------
print_header("OUTPUT FILES INVENTORY")
p(file_info(PAIRS))
p(file_info(SAMPLE))
p(file_info(METRICS_CSV))
p(file_info(SUMMARY_JSON))
p(file_info(META_JSON))

# ---------- 1) Load and print metadata + tuning summary ----------
print_header("METADATA + TUNING SUMMARY (QUICK VIEW)")
meta = {}
if META_JSON.exists():
    meta = json.loads(META_JSON.read_text())
    p(json.dumps(meta.get("config", {}), indent=2))
    p(json.dumps(meta.get("stats", {}), indent=2))
else:
    p("No metadata JSON found.")

if SUMMARY_JSON.exists():
    summ = json.loads(SUMMARY_JSON.read_text())
    p("Tuning summary:")
    p(json.dumps(summ, indent=2))
else:
    p("No tuning summary JSON found.")

if METRICS_CSV.exists():
    p("\nMetrics CSV (first 10 rows):")
    try:
        mdf = pd.read_csv(METRICS_CSV)
        p(mdf.head(10).to_string(index=False))
        # Print best rows
        best_f1_row = mdf.iloc[mdf["f1"].values.argmax()]
        p(f"\nBest F1 threshold: {best_f1_row['threshold']:.3f} | P={best_f1_row['precision']:.3f} R={best_f1_row['recall']:.3f}")
        if "precision" in mdf.columns:
            hi_p = mdf[mdf["precision"] >= 0.90]
            if not hi_p.empty:
                r = hi_p.iloc[hi_p["f1"].values.argmax()]
                p(f"Best @P>=0.90: thr={r['threshold']:.3f} | P={r['precision']:.3f} R={r['recall']:.3f} F1={r['f1']:.3f}")
    except Exception as e:
        p(f"‚ö†Ô∏è Failed reading metrics CSV: {e}")
else:
    p("No metrics CSV found.")

# ---------- 2) Peek at SAMPLE parquet ----------
print_header("SAMPLE PARQUET QUICK PEEK")
if SAMPLE.exists():
    sdf = pd.read_parquet(SAMPLE)
    p(f"Sample rows: {len(sdf):,}")
    p("\nHead(5):"); p(sdf.head(5).to_string(index=False))
    p("\nTail(5):"); p(sdf.tail(5).to_string(index=False))
    p("\nRandom(5):"); p(sdf.sample(min(5, len(sdf)), random_state=42).to_string(index=False))
    # Basic stats on sample
    if "cosine_sim" in sdf.columns:
        p("\nSample cosine_sim stats:")
        p(sdf["cosine_sim"].describe(percentiles=[.25,.5,.75,.9,.95,.99]).to_string())
    # Short tokens diagnostics on sample
    sdf["_len_a"] = sdf["token_a"].str.len()
    sdf["_len_b"] = sdf["token_b"].str.len()
    short_cut = sdf[(sdf["_len_a"] <= 3) | (sdf["_len_b"] <= 3)]
    p(f"\nSample rows with short tokens (<=3 chars): {len(short_cut):,}")
    if not short_cut.empty:
        p(short_cut.nlargest(10, "cosine_sim")[["token_a","token_b","cosine_sim","rank","_len_a","_len_b"]].to_string(index=False))
    # RapidFuzz spot-check (sample) to detect semantic vs surface mismatch
    if HAS_RAPIDFUZZ:
        tmp = sdf.sample(min(1000, len(sdf)), random_state=123).copy()
        tmp["rf_ratio"] = [fuzz_ratio(a,b)/100.0 for a,b in zip(tmp["token_a"], tmp["token_b"])]
        low_char_high_sem = tmp[(tmp["rf_ratio"] < 0.4) & (tmp["cosine_sim"] >= 0.8)]
        p(f"\nLow RapidFuzz (<0.4) but high cosine (>=0.8) in 1k sample: {len(low_char_high_sem):,}")
        if not low_char_high_sem.empty:
            p(low_char_high_sem.nlargest(20, "cosine_sim")[["token_a","token_b","cosine_sim","rf_ratio"]].to_string(index=False))
else:
    p("No sample parquet found.")

# ---------- 3) FULL PAIRS exploration (DuckDB preferred) ----------
print_header("FULL PAIRS EXPLORATION (DuckDB/Arrow)")

if duckdb and PAIRS.exists():
    p("Using DuckDB for SQL on parquet (no full load).")
    con = duckdb.connect()
    con.execute("PRAGMA threads=%d" % max(1, os.cpu_count() or 4))
    con.execute("SET memory_limit='80%';")
    con.execute("INSTALL json; LOAD json;")  # harmless if already loaded

    # Global stats
    p("\n[GLOBAL]")
    q = f"""
        SELECT
          COUNT(*) AS rows,
          MIN(cosine_sim) AS min_sim,
          AVG(cosine_sim) AS mean_sim,
          MAX(cosine_sim) AS max_sim
        FROM read_parquet('{PAIRS.as_posix()}')
    """
    p(con.execute(q).df().to_string(index=False))

    # Histogram bins
    p("\n[HISTOGRAM 0.45..1.0, bin=0.05]")
    q = f"""
        WITH binned AS (
          SELECT
            CAST(FLOOR((cosine_sim - 0.45) / 0.05) AS INTEGER) AS bin_id
          FROM read_parquet('{PAIRS.as_posix()}')
          WHERE cosine_sim >= 0.45
        )
        SELECT bin_id,
               0.45 + bin_id*0.05 AS bin_start,
               0.45 + (bin_id+1)*0.05 AS bin_end,
               COUNT(*) AS cnt
        FROM binned
        GROUP BY 1
        ORDER BY 1
    """
    p(con.execute(q).df().to_string(index=False))

    # Top hubs (many neighbors per token_a)
    p("\n[TOP HUBS by degree (token_a count)]")
    q = f"""
        SELECT token_a, COUNT(*) AS deg
        FROM read_parquet('{PAIRS.as_posix()}')
        GROUP BY token_a
        ORDER BY deg DESC
        LIMIT 30
    """
    p(con.execute(q).df().to_string(index=False))

    # Very short tokens involvement
    p("\n[SHORT TOKENS involvement (<=3 chars)]")
    q = f"""
        SELECT COUNT(*) AS rows_short
        FROM read_parquet('{PAIRS.as_posix()}')
        WHERE length(token_a) <= 3 OR length(token_b) <= 3
    """
    p(con.execute(q).df().to_string(index=False))

    # Extremes for manual inspection
    p("\n[TOP 20 pairs by similarity]")
    q = f"""
        SELECT token_a, token_b, cosine_sim, rank
        FROM read_parquet('{PAIRS.as_posix()}')
        ORDER BY cosine_sim DESC
        LIMIT 20
    """
    p(con.execute(q).df().to_string(index=False))

    p("\n[BORDERLINE near threshold 0.45..0.47 (20 rows)]")
    q = f"""
        SELECT token_a, token_b, cosine_sim, rank
        FROM read_parquet('{PAIRS.as_posix()}')
        WHERE cosine_sim BETWEEN 0.45 AND 0.47
        ORDER BY cosine_sim ASC
        LIMIT 20
    """
    p(con.execute(q).df().to_string(index=False))

    # Optional: cap per-token neighbors gauge
    p("\n[DEGREE QUANTILES for token_a]")
    q = f"""
        SELECT
          MIN(deg) AS min_deg,
          AVG(deg) AS mean_deg,
          MAX(deg) AS max_deg,
          QUANTILE_CONT(deg, 0.50) AS p50,
          QUANTILE_CONT(deg, 0.90) AS p90,
          QUANTILE_CONT(deg, 0.99) AS p99
        FROM (
          SELECT token_a, COUNT(*) AS deg
          FROM read_parquet('{PAIRS.as_posix()}')
          GROUP BY token_a
        )
    """
    p(con.execute(q).df().to_string(index=False))

    con.close()

elif pa_ds and PAIRS.exists():
    p("DuckDB not available. Using PyArrow dataset scan (chunked).")
    import pyarrow.dataset as ds
    dsobj = ds.dataset(PAIRS.as_posix(), format="parquet")
    scanner = dsobj.scanner(columns=["token_a","token_b","cosine_sim","rank"], batch_size=250_000)

    n_rows = 0
    min_sim, max_sim, sum_sim = 1.0, 0.0, 0.0
    bins = np.zeros(11, dtype=np.int64)  # 0.45..1.0 step 0.05 + overflow/underflow bucket
    short_rows = 0
    t0 = time.time()
    for batch in scanner.to_batches():
        df = batch.to_pandas(types_mapper=None)
        n = len(df); n_rows += n
        cs = df["cosine_sim"].to_numpy(np.float32, copy=False)
        min_sim = min(min_sim, float(cs.min()))
        max_sim = max(max_sim, float(cs.max()))
        sum_sim += float(cs.sum())
        # bins
        idx = ((cs - 0.45) / 0.05).astype(np.int32)
        for b in idx[(idx >= 0) & (idx < 11)]:
            bins[int(b)] += 1
        # short tokens
        short_rows += int(((df["token_a"].str.len() <= 3) | (df["token_b"].str.len() <= 3)).sum())
    mean_sim = sum_sim / max(1, n_rows)
    p(f"[GLOBAL] rows={n_rows:,} min={min_sim:.3f} mean={mean_sim:.3f} max={max_sim:.3f}")
    p("[HISTOGRAM 0.45..1.0 step 0.05]")
    for i, c in enumerate(bins):
        p(f"{0.45 + i*0.05:.2f}‚Äì{0.45 + (i+1)*0.05:.2f}: {int(c):,}")
    p(f"[SHORT TOKENS] rows_short = {short_rows:,} (of {n_rows:,})")
    p(f"Scan time: {time.time()-t0:.2f}s")

else:
    p("‚ö†Ô∏è Neither DuckDB nor PyArrow available; using SAMPLE only. Install duckdb or pyarrow for full-file stats: `pip install duckdb pyarrow`.")

# ---------- 4) Action flags (what to tweak next) ----------
print_header("ACTION FLAGS (NEXT FEATURE-ENGINEERING STEPS)")
# Heuristics based on observed stats from metadata/prints
thr_used = None
try:
    thr_used = float(meta.get("config", {}).get("similarity_threshold_used", None))
except Exception:
    thr_used = None

# Print deterministic guidance; adjust as needed after reading console outputs.
p(f"- Threshold in use: {thr_used if thr_used is not None else 'unknown'}")
p("- Inspect histogram & borderline pairs (0.45‚Äì0.50). If many questionable, raise threshold (e.g., +0.05).")
p("- Check TOP HUBS. If heavy tails (p99 deg >> p50), cap neighbors per token (e.g., keep top-10 by sim per token).")
p("- If short tokens appear frequently, add a short-token guard: require len>=4 or higher threshold for len<=3.")
p("- Consider symmetry filter: keep A‚ÜîB only if both sides rank <= R (e.g., R=10) to drop one-sided noise.")
p("- Optionally add RapidFuzz gate on short strings: rf_ratio>=0.6 when min(len_a,len_b)<=4.")
p("- Next build clusters (connected components on pairs) to produce canonical groups & representatives.")

print_header("DONE ‚Äî Review above prints to pick concrete thresholds/filters.")

üîé OUTPUT FILES INVENTORY
romance-novel-nlp-research/src/eda_analysis/outputs/candidate_similarity_pairs.parquet | exists=True | size=176.77 MB
romance-novel-nlp-research/src/eda_analysis/outputs/similarity_sample_inspection.parquet | exists=True | size=181.64 KB
romance-novel-nlp-research/src/eda_analysis/outputs/similarity_threshold_metrics.csv | exists=True | size=6.67 KB
romance-novel-nlp-research/src/eda_analysis/outputs/similarity_threshold_summary.json | exists=True | size=339.00 B
romance-novel-nlp-research/src/eda_analysis/outputs/cell3_similarity_metadata.json | exists=True | size=1.27 KB
üîé METADATA + TUNING SUMMARY (QUICK VIEW)
{
  "model": "sentence-transformers/all-MiniLM-L6-v2",
  "batch_size": 1024,
  "normalize": true,
  "top_k_neighbors": 50,
  "similarity_threshold_initial": 0.3,
  "similarity_threshold_used": 0.44999992847442627,
  "hnsw": {
    "M": 16,
    "ef_construction": 200,
    "ef": 200
  },
  "query_batch_size": 25000,
  "tuning": {
    "enabled": true

In [11]:
# romance-novel-nlp-research/src/eda_analysis/cell3b_filter_and_audit.py
"""
Why: Decide next feature-engineering steps with concrete evidence.
This streams the full pairs parquet, applies proposed filters, prints exhaustive stats, and writes filtered outputs.
"""

from __future__ import annotations
import os, sys, json, time, heapq
from pathlib import Path
from collections import defaultdict
from typing import Dict, List, Tuple, Optional

import numpy as np
import pandas as pd

# ---------- Config (tweak here) ----------
BASE_THRESHOLD: float = 0.50      # raise from 0.45 ‚Üí trims ~646k rows
SHORT_LEN_MAX: int = 3
SHORT_MIN_SIM: float = 0.80
SHORT_MIN_RF: float = 0.60        # requires rapidfuzz; otherwise skip this gate
SYMMETRY_R: int = 10              # keep only if A->B rank<=R AND B->A rank<=R
DEGREE_CAP: int = 25              # top-K per token_a after filters

SAMPLE_SAVE_MAX: int = 5000
PRINT_TOP_HUBS: int = 30
HIST_BIN_START: float = 0.45
HIST_BIN_STEP: float = 0.05
HIST_BIN_END: float = 1.00

OUT_DIR = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
IN_PAIRS = OUT_DIR / "candidate_similarity_pairs.parquet"
IN_META  = OUT_DIR / "cell3_similarity_metadata.json"

OUT_PAIRS_FILTERED = OUT_DIR / "candidate_similarity_pairs.filtered.parquet"
OUT_SAMPLE_FILTERED = OUT_DIR / "similarity_sample_inspection.filtered.parquet"
OUT_FILTER_META = OUT_DIR / "cell3_filters_metadata.json"

# ---------- Helpers ----------
def p(x: str) -> None:
    print(x); sys.stdout.flush()

def head(title: str) -> None:
    sep = "=" * 90
    p(sep); p(f"üîé {title}"); p(sep)

def human(n: int) -> str:
    units = ["B","KB","MB","GB","TB"]; i=0; x=float(n)
    while x>=1024 and i<len(units)-1: x/=1024; i+=1
    return f"{x:.2f} {units[i]}"

def file_info(path: Path) -> str:
    return f"{path} | exists={path.exists()} | size={human(path.stat().st_size) if path.exists() else '‚Äî'}"

# ---------- Imports (optional) ----------
try:
    import pyarrow.dataset as ds
    import pyarrow as pa
except Exception as e:
    raise RuntimeError("Requires pyarrow. Install: pip install pyarrow") from e

try:
    from rapidfuzz.fuzz import ratio as fuzz_ratio
    HAS_RAPIDFUZZ = True
except Exception:
    HAS_RAPIDFUZZ = False

# ---------- 0) Inventory ----------
head("INPUTS")
p(file_info(IN_PAIRS))
p(file_info(IN_META))
if not IN_PAIRS.exists():
    raise FileNotFoundError("pairs parquet not found.")
meta = json.loads(IN_META.read_text()) if IN_META.exists() else {}
p("\nMeta (config.stats excerpt):")
p(json.dumps({"config": meta.get("config", {}), "stats": meta.get("stats", {})}, indent=2))

# ---------- 1) Before-stats (global + histogram + short rate, streaming) ----------
head("BEFORE STATS (STREAMING)")
dsobj = ds.dataset(IN_PAIRS.as_posix(), format="parquet")
scanner = dsobj.scanner(columns=["token_a","token_b","cosine_sim","rank"], batch_size=250_000)

total_rows = 0
short_rows = 0
min_sim = 1.0
max_sim = 0.0
sum_sim = 0.0

# 0.45..1.00 bins
nbins = int(np.ceil((HIST_BIN_END - HIST_BIN_START) / HIST_BIN_STEP))
bins = np.zeros(nbins, dtype=np.int64)

t0 = time.time()
for batch in scanner.to_batches():
    df = batch.to_pandas()
    n = len(df); total_rows += n
    cs = df["cosine_sim"].to_numpy(np.float32, copy=False)
    min_sim = min(min_sim, float(cs.min()))
    max_sim = max(max_sim, float(cs.max()))
    sum_sim += float(cs.sum())
    idx = ((cs - HIST_BIN_START) / HIST_BIN_STEP).astype(np.int32)
    m = (idx >= 0) & (idx < nbins)
    if m.any():
        bincount = np.bincount(idx[m], minlength=nbins)
        bins[:len(bincount)] += bincount
    short_rows += int(((df["token_a"].str.len() <= SHORT_LEN_MAX) | (df["token_b"].str.len() <= SHORT_LEN_MAX)).sum())

mean_sim = sum_sim / max(1, total_rows)
p(f"Rows={total_rows:,} | min={min_sim:.3f} mean={mean_sim:.3f} max={max_sim:.3f}")
p("Histogram (pre-filter):")
for i in range(nbins):
    start = HIST_BIN_START + i*HIST_BIN_STEP
    end = start + HIST_BIN_STEP
    p(f"{start:.2f}‚Äì{end:.2f}: {int(bins[i]):,}")
p(f"Short-token rows (<= {SHORT_LEN_MAX} chars): {short_rows:,} ({short_rows/total_rows*100:.2f}%)")
p(f"Scan time: {time.time()-t0:.2f}s")

# ---------- 2) Pass-0: reverse_rank for symmetry (only ranks ‚â§ R) ----------
head(f"BUILD REVERSE RANK MAP (rank‚â§{SYMMETRY_R})")
rev_map: Dict[Tuple[str,str], int] = {}
t1 = time.time()
scanner_r = dsobj.scanner(columns=["token_a","token_b","rank"], batch_size=250_000)
kept_rev = 0
for batch in scanner_r.to_batches():
    df = batch.to_pandas()
    small = df[df["rank"] <= SYMMETRY_R]
    if small.empty:
        continue
    for a, b, r in zip(small["token_a"], small["token_b"], small["rank"]):
        rev_map[(a, b)] = int(r)
    kept_rev += len(small)
p(f"Reverse entries stored: {kept_rev:,} | time: {time.time()-t1:.2f}s")

# ---------- 3) Pass-1: apply filters + per-token top-K ----------
head("APPLY FILTERS (streaming) + PER-TOKEN TOP-K")
drop_stats = defaultdict(int)
heaps: Dict[str, List[Tuple[float,int,str,float]]] = defaultdict(list)  # token_a -> min-heap of (sim, rank, token_b, sim) ; sim duplicated for clarity

def maybe_push(a: str, b: str, sim: float, rank: int) -> None:
    h = heaps[a]
    item = (sim, rank, b, sim)
    if len(h) < DEGREE_CAP:
        heapq.heappush(h, item)
    else:
        if sim > h[0][0]:  # replace worst
            heapq.heapreplace(h, item)

def short_pair_gate(a: str, b: str, sim: float) -> bool:
    la = len(a); lb = len(b)
    if min(la, lb) > SHORT_LEN_MAX:
        return True
    if sim < SHORT_MIN_SIM:
        return False
    if HAS_RAPIDFUZZ:
        return (fuzz_ratio(a, b) / 100.0) >= SHORT_MIN_RF
    return True  # no RapidFuzz ‚Üí skip ratio gate

def sym_gate(a: str, b: str, rank_ab: int) -> bool:
    if rank_ab > SYMMETRY_R:
        return False
    rb = rev_map.get((b, a), None)
    return (rb is not None) and (rb <= SYMMETRY_R)

t2 = time.time()
scanner_f = dsobj.scanner(columns=["token_a","token_b","cosine_sim","rank"], batch_size=200_000)
rows_seen = 0
for batch in scanner_f.to_batches():
    df = batch.to_pandas()
    rows_seen += len(df)
    for a, b, sim, r in zip(df["token_a"], df["token_b"], df["cosine_sim"], df["rank"]):
        # base threshold
        if sim < BASE_THRESHOLD:
            drop_stats["below_base_threshold"] += 1
            continue
        # short gate
        if not short_pair_gate(a, b, float(sim)):
            if min(len(a), len(b)) <= SHORT_LEN_MAX and sim < SHORT_MIN_SIM:
                drop_stats["short_below_short_min_sim"] += 1
            else:
                drop_stats["short_below_rapidfuzz"] += 1
            continue
        # symmetry
        if not sym_gate(a, b, int(r)):
            drop_stats["failed_symmetry"] += 1
            continue
        maybe_push(a, b, float(sim), int(r))
p(f"Streamed rows: {rows_seen:,} | time: {time.time()-t2:.2f}s")

# ---------- 4) Materialize filtered edges ----------
head("MATERIALIZE FILTERED EDGES")
rows = []
for a, h in heaps.items():
    for sim, r, b, sim_copy in sorted(h, key=lambda x: (-x[0], x[1])):  # highest sim first, then rank
        rows.append((a, b, float(sim), int(r)))
filtered_df = pd.DataFrame(rows, columns=["token_a","token_b","cosine_sim","rank"])
filtered_df.sort_values(["token_a","rank","cosine_sim"], ascending=[True, True, False], inplace=True)
p(f"Filtered pairs: {len(filtered_df):,}  | tokens covered: {filtered_df['token_a'].nunique():,}")

# ---------- 5) After-stats ----------
head("AFTER STATS")
def hist_of(series: pd.Series, start: float, end: float, step: float) -> List[Tuple[float,float,int]]:
    nb = int(np.ceil((end-start)/step)); out=[]
    arr = series.to_numpy(np.float32, copy=False)
    idx = ((arr - start) / step).astype(np.int32)
    counts = np.zeros(nb, dtype=np.int64)
    m = (idx >= 0) & (idx < nb)
    if m.any():
        bincount = np.bincount(idx[m], minlength=nb); counts[:len(bincount)] += bincount
    for i in range(nb):
        s = start + i*step; e = s + step
        out.append((s, e, int(counts[i])))
    return out

p(f"Rows={len(filtered_df):,} | min={filtered_df['cosine_sim'].min():.3f} mean={filtered_df['cosine_sim'].mean():.3f} max={filtered_df['cosine_sim'].max():.3f}")
p("Histogram (post-filter):")
for s,e,c in hist_of(filtered_df["cosine_sim"], HIST_BIN_START, HIST_BIN_END, HIST_BIN_STEP):
    p(f"{s:.2f}‚Äì{e:.2f}: {c:,}")

# hubs
deg = filtered_df.groupby("token_a", observed=True).size().sort_values(ascending=False)
p("\nTop hubs (post-filter):")
p(deg.head(PRINT_TOP_HUBS).to_string())
p("\nDegree quantiles (post-filter):")
q = deg.quantile([0.5,0.9,0.99]).to_numpy()
p(f"p50={q[0]:.1f}  p90={q[1]:.1f}  p99={q[2]:.1f}  max={deg.max()}")

# short-token share
short_mask = (filtered_df["token_a"].str.len() <= SHORT_LEN_MAX) | (filtered_df["token_b"].str.len() <= SHORT_LEN_MAX)
p(f"\nShort-token rows (post-filter): {int(short_mask.sum()):,} ({short_mask.mean()*100:.2f}%)")

# borderline examples
border = filtered_df[(filtered_df["cosine_sim"] >= BASE_THRESHOLD) & (filtered_df["cosine_sim"] < BASE_THRESHOLD + 0.02)]
p(f"\nBorderline examples [{BASE_THRESHOLD:.2f}‚Äì{BASE_THRESHOLD+0.02:.2f}) (up to 20):")
p(border.head(20).to_string(index=False))

# ---------- 6) Drop breakdown ----------
head("DROP BREAKDOWN")
total_before = total_rows
kept_after = len(filtered_df)
dropped_total = total_before - kept_after
p(f"Before={total_before:,}  After={kept_after:,}  Dropped={dropped_total:,} ({dropped_total/max(1,total_before)*100:.2f}%)")
p("By reason (approx, non-exclusive across stages except base filter):")
for k in sorted(drop_stats.keys()):
    p(f"- {k}: {drop_stats[k]:,}")

# ---------- 7) Save outputs ----------
head("SAVE FILTERED OUTPUTS")
filtered_df.to_parquet(OUT_PAIRS_FILTERED, index=False)
samp_n = min(SAMPLE_SAVE_MAX, len(filtered_df))
if samp_n > 0:
    filtered_df.sample(n=samp_n, random_state=42).to_parquet(OUT_SAMPLE_FILTERED, index=False)

filters_meta = {
    "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
    "input_pairs_file": str(IN_PAIRS),
    "filters": {
        "base_threshold": BASE_THRESHOLD,
        "short_len_max": SHORT_LEN_MAX,
        "short_min_sim": SHORT_MIN_SIM,
        "short_min_rapidfuzz": SHORT_MIN_RF if HAS_RAPIDFUZZ else None,
        "symmetry_rank_max": SYMMETRY_R,
        "degree_cap": DEGREE_CAP,
        "rapidfuzz_available": HAS_RAPIDFUZZ
    },
    "before": {
        "rows": int(total_rows),
        "hist_bins": {f"{HIST_BIN_START+i*HIST_BIN_STEP:.2f}-{HIST_BIN_START+(i+1)*HIST_BIN_STEP:.2f}": int(bins[i]) for i in range(nbins)},
        "short_rows": int(short_rows)
    },
    "after": {
        "rows": int(kept_after),
        "short_rows": int(short_mask.sum()),
        "degree_quantiles": {"p50": float(q[0]), "p90": float(q[1]), "p99": float(q[2]), "max": int(deg.max())}
    },
}
with open(OUT_FILTER_META, "w", encoding="utf-8") as f:
    json.dump(filters_meta, f, indent=2, ensure_ascii=False)

p(f"‚úÖ Saved filtered pairs ‚Üí {OUT_PAIRS_FILTERED}")
if samp_n > 0:
    p(f"‚úÖ Saved filtered sample ({samp_n}) ‚Üí {OUT_SAMPLE_FILTERED}")
p(f"‚úÖ Saved filter metadata ‚Üí {OUT_FILTER_META}")

# ---------- 8) Clear next-step guidance ----------
head("NEXT STEP SUGGESTION")
p("- If hubs remain spiky (p99>>p50), consider lowering DEGREE_CAP or raising SYMMETRY_R stringency.")
p("- If borderline noise persists, move BASE_THRESHOLD to 0.55 and re-run this cell.")
p("- Then cluster filtered graph (union-find) to derive canonical groups.")

üîé INPUTS
romance-novel-nlp-research/src/eda_analysis/outputs/candidate_similarity_pairs.parquet | exists=True | size=176.77 MB
romance-novel-nlp-research/src/eda_analysis/outputs/cell3_similarity_metadata.json | exists=True | size=1.27 KB

Meta (config.stats excerpt):
{
  "config": {
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "batch_size": 1024,
    "normalize": true,
    "top_k_neighbors": 50,
    "similarity_threshold_initial": 0.3,
    "similarity_threshold_used": 0.44999992847442627,
    "hnsw": {
      "M": 16,
      "ef_construction": 200,
      "ef": 200
    },
    "query_batch_size": 25000,
    "tuning": {
      "enabled": true,
      "csv": null,
      "min_precision": 0.9,
      "min_recall": null,
      "grid": [
        0.05,
        0.95,
        0.01
      ],
      "metrics_csv": "romance-novel-nlp-research/src/eda_analysis/outputs/similarity_threshold_metrics.csv",
      "summary_json": "romance-novel-nlp-research/src/eda_analysis/outputs/similarity_th

In [12]:
# romance-novel-nlp-research/src/eda_analysis/cell3c_cluster_and_inspect.py

from __future__ import annotations
import sys, os, time, json, math, re
from pathlib import Path
from typing import Dict, List, Tuple
from collections import defaultdict, Counter

import numpy as np
import pandas as pd

# --- Paths ---
OUT = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
PAIRS = OUT / "candidate_similarity_pairs.filtered.parquet"
META_JSON = OUT / "cell3_similarity_metadata.json"            # from Cell 3
FILTER_META_JSON = OUT / "cell3_filters_metadata.json"        # from Cell 3b
CLUSTERS_MAP = OUT / "clusters_token_map.parquet"
CLUSTERS_SUM = OUT / "clusters_summary.parquet"
CLUSTERS_EDGES_SAMPLE = OUT / "clusters_edges_samples.parquet"

# --- Printing helpers ---
def p(x: str) -> None:
    print(x); sys.stdout.flush()

def head(title: str) -> None:
    sep = "=" * 100
    p(sep); p(f"üîé {title}"); p(sep)

def human(n: int) -> str:
    units = ["B","KB","MB","GB","TB"]; i=0; x=float(n)
    while x>=1024 and i<len(units)-1: x/=1024; i+=1
    return f"{x:.2f} {units[i]}"

# --- Require PyArrow Dataset for streaming ---
try:
    import pyarrow as pa
    import pyarrow.dataset as ds
except Exception as e:
    raise RuntimeError("Requires pyarrow. Install: pip install pyarrow") from e

# --- Tiny utils ---
DIGIT_RE = re.compile(r"\d")
def flag_short(s: str) -> bool: return len(s) <= 3
def flag_digit(s: str) -> bool: return bool(DIGIT_RE.search(s))
def flag_zz(s: str) -> bool: return s.strip().lower().startswith("zz")

# --- Disjoint Set (Union-Find) with dynamic add (why: 230k+ nodes) ---
class DSU:
    def __init__(self):
        self.parent: Dict[int,int] = {}
        self.size: Dict[int,int] = {}

    def _add(self, x: int):
        if x not in self.parent:
            self.parent[x] = x
            self.size[x] = 1

    def find(self, x: int) -> int:
        self._add(x)
        # Path compression
        while self.parent[x] != x:
            self.parent[x] = self.parent[self.parent[x]]
            x = self.parent[x]
        return x

    def union(self, a: int, b: int):
        ra, rb = self.find(a), self.find(b)
        if ra == rb: return
        if self.size[ra] < self.size[rb]:
            ra, rb = rb, ra
        self.parent[rb] = ra
        self.size[ra] += self.size[rb]

# --- 0) Inventory ---
head("INPUTS")
p(f"{PAIRS} | exists={PAIRS.exists()} | size={human(PAIRS.stat().st_size) if PAIRS.exists() else '‚Äî'}")
p(f"{META_JSON} | exists={META_JSON.exists()}")
p(f"{FILTER_META_JSON} | exists={FILTER_META_JSON.exists()}")

meta = json.loads(META_JSON.read_text()) if META_JSON.exists() else {}
filter_meta = json.loads(FILTER_META_JSON.read_text()) if FILTER_META_JSON.exists() else {}
total_tokens = meta.get("stats", {}).get("total_tokens", None)

p("\nMeta excerpt:")
p(json.dumps({"stats": meta.get("stats", {}), "config": meta.get("config", {})}, indent=2))
p("\nFilter meta excerpt:")
p(json.dumps(filter_meta.get("filters", {}), indent=2) if filter_meta else "‚Äî")

if not PAIRS.exists():
    raise FileNotFoundError("Filtered pairs parquet not found. Run cell3b_filter_and_audit.py first.")

# --- 1) STREAM EDGES ‚Üí BUILD DSU + DEG FEATURES ---
head("BUILD GRAPH (UNION-FIND) + DEGREE FEATURES (STREAMING)")

dsobj = ds.dataset(PAIRS.as_posix(), format="parquet")
scanner = dsobj.scanner(columns=["token_a","token_b","cosine_sim","rank"], batch_size=250_000)

tok2id: Dict[str,int] = {}
id2tok: List[str] = []
def get_id(s: str) -> int:
    i = tok2id.get(s)
    if i is None:
        i = len(id2tok)
        tok2id[s] = i
        id2tok.append(s)
    return i

dsu = DSU()
deg = Counter()       # degree count
wdeg = Counter()      # weighted degree (sum cos)
maxsim = Counter()    # max neighbor sim

n_edges = 0
t0 = time.time()
for batch in scanner.to_batches():
    df = batch.to_pandas()
    for a, b, sim in zip(df["token_a"], df["token_b"], df["cosine_sim"]):
        ia, ib = get_id(a), get_id(b)
        dsu.union(ia, ib)
        deg[ia] += 1; deg[ib] += 1
        wdeg[ia] += float(sim); wdeg[ib] += float(sim)
        if float(sim) > maxsim[ia]: maxsim[ia] = float(sim)
        if float(sim) > maxsim[ib]: maxsim[ib] = float(sim)
        n_edges += 1

n_nodes = len(id2tok)
elapsed = time.time() - t0
p(f"Nodes={n_nodes:,} | Edges={n_edges:,} | time={elapsed:.2f}s")

coverage = (n_nodes / total_tokens * 100.0) if total_tokens else None
if coverage is not None:
    p(f"Token coverage vs meta.total_tokens: {coverage:.2f}%")

# --- 2) COMPONENTS ---
head("COMPUTE CONNECTED COMPONENTS")
t1 = time.time()
root_of = [dsu.find(i) for i in range(n_nodes)]
comp2nodes: Dict[int, List[int]] = defaultdict(list)
for i, r in enumerate(root_of):
    comp2nodes[r].append(i)
n_comps = len(comp2nodes)
sizes = [len(v) for v in comp2nodes.values()]
p(f"Components={n_comps:,} | mean size={np.mean(sizes):.2f} | median={np.median(sizes):.0f} | max={np.max(sizes):,}")
# Size buckets
bins = [(1,1),(2,2),(3,5),(6,10),(11,20),(21,50),(51,100),(101,99999999)]
bucket_counts = []
for lo,hi in bins:
    c = sum(1 for s in sizes if lo <= s <= hi)
    bucket_counts.append(((lo,hi), c))
p("Size buckets:")
for (lo,hi), c in bucket_counts:
    p(f"{lo:>3}-{hi:<3}: {c:,}")
p(f"time={time.time()-t1:.2f}s")

# --- 3) BUILD DATAFRAMES: token‚Üícluster, summary per cluster ---
head("BUILD CLUSTER TABLES")
t2 = time.time()
# cluster id: compact 0..C-1
roots_sorted = sorted(comp2nodes.keys(), key=lambda r: (-len(comp2nodes[r]), r))
root_to_cid = {r:i for i,r in enumerate(roots_sorted)}

rows_map = []
for r, nodes in comp2nodes.items():
    cid = root_to_cid[r]
    for i in nodes:
        s = id2tok[i]
        rows_map.append((
            s, cid, int(deg[i]), float(wdeg[i]),
            flag_short(s), flag_digit(s), flag_zz(s)
        ))

map_df = pd.DataFrame(rows_map, columns=["token","cluster_id","degree","wdegree","is_short","has_digit","starts_zz"])

# cluster summary
summ_rows = []
for r, nodes in comp2nodes.items():
    cid = root_to_cid[r]
    size = len(nodes)
    # medoid = max wdegree (why: most connected/central)
    med_i = max(nodes, key=lambda i: (wdeg[i], deg[i]))
    med_token = id2tok[med_i]
    med_deg = int(deg[med_i]); med_wdeg = float(wdeg[med_i])
    # flags
    short_rate = np.mean([flag_short(id2tok[i]) for i in nodes])
    digit_rate = np.mean([flag_digit(id2tok[i]) for i in nodes])
    zz_rate = np.mean([flag_zz(id2tok[i]) for i in nodes])
    mean_deg = float(np.mean([deg[i] for i in nodes]))
    mean_wdeg = float(np.mean([wdeg[i] for i in nodes]))
    summ_rows.append((
        cid, size, med_token, med_deg, med_wdeg, mean_deg, mean_wdeg, short_rate, digit_rate, zz_rate
    ))

sum_df = pd.DataFrame(
    summ_rows,
    columns=["cluster_id","size","medoid","medoid_deg","medoid_wdeg","mean_deg","mean_wdeg","short_rate","digit_rate","zz_rate"]
).sort_values(["size","medoid_wdeg"], ascending=[False, False])

p(f"map_df: {len(map_df):,} rows | clusters: {len(sum_df):,} | time={time.time()-t2:.2f}s")

# --- 4) PRINT EXCESSIVE DIAGNOSTICS ---
head("CLUSTER DIAGNOSTICS (PRINTS)")

# Top 20 largest clusters
topk = sum_df.head(20)
p("\nTop 20 clusters (size, medoid, deg, flags):")
p(topk[["cluster_id","size","medoid","medoid_deg","medoid_wdeg","short_rate","digit_rate","zz_rate"]].to_string(index=False))

# Problematic clusters by flags
flaggy = sum_df.query("short_rate>0.50 or digit_rate>0.50 or zz_rate>0.50").head(20)
p("\nTop flagged clusters (potential garbage):")
p(flaggy.to_string(index=False) if not flaggy.empty else "(none)")

# Cluster size histogram
hist_bins = [1,2,3,5,10,20,50,100,200,500,1000,999999]
hist_counts = {}
sizes_arr = sum_df["size"].to_numpy()
for i in range(len(hist_bins)-1):
    lo, hi = hist_bins[i], hist_bins[i+1]
    cnt = int(((sizes_arr >= lo) & (sizes_arr < hi)).sum())
    hist_counts[f"{lo}-{hi-1}"] = cnt
p("\nCluster size histogram:")
for k in hist_counts:
    p(f"{k}: {hist_counts[k]:,}")

# Show a few clusters with examples
def show_cluster(cid: int, n_tokens: int = 15):
    sub = map_df[map_df["cluster_id"] == cid].copy()
    sub.sort_values(["degree","wdegree"], ascending=False, inplace=True)
    p(f"\n[Cluster {cid}] size={len(sub):,}  medoid='{sum_df.loc[sum_df['cluster_id']==cid,'medoid'].values[0]}'")
    p(sub.head(n_tokens)[["token","degree","wdegree","is_short","has_digit","starts_zz"]].to_string(index=False))

p("\nExamples from a few largest clusters:")
for cid in topk["cluster_id"].head(5).tolist():
    show_cluster(cid)

# Sample edges for those clusters (for eyeballing)
head("EDGE SAMPLES FROM TOP CLUSTERS")
need_cids = set(topk["cluster_id"].head(5).tolist())
cid_of_token = dict(zip(map_df["token"], map_df["cluster_id"]))
samples = []
scan2 = dsobj.scanner(columns=["token_a","token_b","cosine_sim","rank"], batch_size=250_000)
take_per_cid = 200  # cap prints
kept_per_cid = Counter()
for batch in scan2.to_batches():
    df = batch.to_pandas()
    for a, b, sim, r in zip(df["token_a"], df["token_b"], df["cosine_sim"], df["rank"]):
        ca = cid_of_token.get(a); cb = cid_of_token.get(b)
        if ca is None or cb is None or ca != cb or ca not in need_cids:
            continue
        if kept_per_cid[ca] >= take_per_cid:
            continue
        samples.append((ca, a, b, float(sim), int(r)))
        kept_per_cid[ca] += 1
        if all(kept_per_cid[c] >= take_per_cid for c in need_cids):
            break
    if all(kept_per_cid[c] >= take_per_cid for c in need_cids):
        break

edge_samples_df = pd.DataFrame(samples, columns=["cluster_id","token_a","token_b","cosine_sim","rank"])
p(edge_samples_df.head(40).to_string(index=False))

# Borderline per-cluster (0.50‚Äì0.55) to spot weak links
border = edge_samples_df[(edge_samples_df["cosine_sim"] >= 0.50) & (edge_samples_df["cosine_sim"] < 0.55)]
p(f"\nBorderline edges in sampled top clusters (0.50‚Äì0.55): {len(border):,}")
if not border.empty:
    p(border.sort_values(["cosine_sim"]).head(30).to_string(index=False))

# --- 5) SAVE ARTIFACTS ---
head("SAVE CLUSTER ARTIFACTS")
map_df.to_parquet(CLUSTERS_MAP, index=False)
sum_df.to_parquet(CLUSTERS_SUM, index=False)
edge_samples_df.to_parquet(CLUSTERS_EDGES_SAMPLE, index=False)
p(f"‚úÖ Saved token‚Üícluster map ‚Üí {CLUSTERS_MAP}")
p(f"‚úÖ Saved cluster summary ‚Üí {CLUSTERS_SUM}")
p(f"‚úÖ Saved edge samples ‚Üí {CLUSTERS_EDGES_SAMPLE}")

# --- 6) NEXT-STEP GUIDANCE (printed) ---
head("NEXT-STEP SUGGESTIONS (ACTIONABLE)")
p("- If flagged clusters dominate (short/digits/zz), add a **cluster-level gate**: drop clusters with flag_rate>0.7 or size<3 and high flag rates.")
p("- Promote medoid tokens as **canonical labels**; export token‚Üícanonical map (medoid).")
p("- If many borderline edges within large clusters, raise BASE_THRESHOLD to 0.55 and re-run filter.")
p("- If hubiness resurfaces in clusters, consider **per-length thresholds** (len‚â§4 ‚Üí sim‚â•0.85) and **token stoplist** (e.g., exact 'zz*').")
p("- Proceed to **merge clusters with high medoid similarity** (medoid‚Äìmedoid sim‚â•0.85) to collapse near-dup clusters.")

üîé INPUTS
romance-novel-nlp-research/src/eda_analysis/outputs/candidate_similarity_pairs.filtered.parquet | exists=True | size=19.02 MB
romance-novel-nlp-research/src/eda_analysis/outputs/cell3_similarity_metadata.json | exists=True
romance-novel-nlp-research/src/eda_analysis/outputs/cell3_filters_metadata.json | exists=True

Meta excerpt:
{
  "stats": {
    "total_tokens": 254778,
    "embedding_dim": 384,
    "candidate_pairs_found": 12502616
  },
  "config": {
    "model": "sentence-transformers/all-MiniLM-L6-v2",
    "batch_size": 1024,
    "normalize": true,
    "top_k_neighbors": 50,
    "similarity_threshold_initial": 0.3,
    "similarity_threshold_used": 0.44999992847442627,
    "hnsw": {
      "M": 16,
      "ef_construction": 200,
      "ef": 200
    },
    "query_batch_size": 25000,
    "tuning": {
      "enabled": true,
      "csv": null,
      "min_precision": 0.9,
      "min_recall": null,
      "grid": [
        0.05,
        0.95,
        0.01
      ],
      "metrics_

In [13]:
# romance-novel-nlp-research/src/eda_analysis/cell3d_refine_giant_cluster.py
from __future__ import annotations
import re, sys, time, json
from pathlib import Path
from typing import Dict, List, Set, Tuple
from collections import defaultdict, Counter, deque

import numpy as np
import pandas as pd

# ---------- Paths ----------
OUT = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
PAIR_F = OUT / "candidate_similarity_pairs.filtered.parquet"
MAP_F  = OUT / "clusters_token_map.parquet"
SUM_F  = OUT / "clusters_summary.parquet"

REF_MAP_F = OUT / "clusters_token_map.refined.parquet"
REF_SUM_F = OUT / "clusters_summary.refined.parquet"
REF_EDGES_F = OUT / "candidate_similarity_pairs.cluster0_refined.parquet"

# ---------- Config (tweak) ----------
BASE_SIM_STRONG = 0.60
BASE_SIM_WEAK   = 0.50
MIN_SHARED_WORDS = 1
SHORT_LEN_MAX = 3
SHORT_RF_MIN = 0.70

STOP_LANGS: List[str] = ["en"]  # e.g., ["en","es","de"]
STOP_BACKEND: str = "auto"      # "auto"|"iso"|"nltk"|"spacy"|"sklearn"

PRINT_EDGE_SAMPLES = 40
TOP_CLUSTERS_PRINT = 20
SUBCLUSTER_SAMPLE_TOKENS = 12

# ---------- Imports ----------
try:
    import pyarrow as pa
    import pyarrow.dataset as ds
except Exception as e:
    raise RuntimeError("Requires pyarrow. Install: pip install pyarrow") from e

try:
    from rapidfuzz.fuzz import ratio as fuzz_ratio
    HAS_RF = True
except Exception:
    HAS_RF = False

# ---------- Printing ----------
def p(x: str): print(x); sys.stdout.flush()
def head(t: str):
    sep = "="*120
    p(sep); p(f"üîé {t}"); p(sep)

# ---------- Stopwords loader ----------
_LANG_MAP_NLTK = {
    "ar":"arabic","da":"danish","nl":"dutch","en":"english","fi":"finnish","fr":"french","de":"german",
    "hu":"hungarian","it":"italian","kk":"kazakh","ne":"nepali","no":"norwegian","pt":"portuguese",
    "ro":"romanian","ru":"russian","sl":"slovene","es":"spanish","sv":"swedish","tr":"turkish"
}
def load_stopwords(langs: List[str], backend: str = "auto") -> Tuple[Set[str], str]:
    langs = [l.lower() for l in langs]
    tried = []
    # 1) stopwords-iso
    if backend in ("auto","iso"):
        try:
            from stopwordsiso import stopwords as sw_iso  # pip install stopwordsiso
            sw: Set[str] = set()
            for l in langs:
                try:
                    sw |= set(w.lower() for w in sw_iso(l))
                except Exception:
                    pass
            if sw:
                return sw, "stopwords-iso"
            tried.append("stopwords-iso(empty)")
        except Exception:
            tried.append("stopwords-iso(missing)")
            if backend == "iso":
                raise RuntimeError("Install stopwords-iso: pip install stopwordsiso")
    # 2) NLTK
    if backend in ("auto","nltk"):
        try:
            import nltk
            try:
                from nltk.corpus import stopwords as nltk_sw
                sw = set()
                for l in langs:
                    name = _LANG_MAP_NLTK.get(l, "english")
                    try:
                        sw |= set(w.lower() for w in nltk_sw.words(name))
                    except LookupError:
                        nltk.download("stopwords", quiet=True)
                        sw |= set(w.lower() for w in nltk_sw.words(name))
                if sw:
                    return sw, "nltk"
                tried.append("nltk(empty)")
            except Exception:
                tried.append("nltk(corpus err)")
        except Exception:
            tried.append("nltk(missing)")
            if backend == "nltk":
                raise RuntimeError("Install NLTK: pip install nltk")
    # 3) spaCy
    if backend in ("auto","spacy"):
        try:
            import spacy
            sw = set()
            for l in langs:
                try:
                    nlp = spacy.blank(l)
                    sw |= set(w.lower() for w in nlp.Defaults.stop_words)
                except Exception:
                    pass
            if sw:
                return sw, "spacy"
            tried.append("spacy(empty)")
        except Exception:
            tried.append("spacy(missing)")
            if backend == "spacy":
                raise RuntimeError("Install spaCy: pip install spacy")
    # 4) scikit-learn (English only)
    if backend in ("auto","sklearn"):
        try:
            from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
            return set(w.lower() for w in ENGLISH_STOP_WORDS), "sklearn"
        except Exception:
            tried.append("sklearn(missing)")
            if backend == "sklearn":
                raise RuntimeError("Install scikit-learn: pip install scikit-learn")
    # 5) Minimal fallback
    base = {"a","an","the","and","or","to","of","in","on","for","with","without","by","at","as","from",
            "this","that","these","those","it","its","is","are","be","was","were","been","am","i","you",
            "he","she","they","we","me","him","her","them","my","your","our","their","not","no","yes",
            "into","over","under","up","down","out","more","most","less","least","very"}
    p(f"‚ö†Ô∏è Using minimal fallback stopword list. Tried: {', '.join(tried)}")
    return base, "fallback"

TOKEN_RE = re.compile(r"[a-z]{2,}")  # keep simple Latin 2+ letters

def build_content_filter(langs: List[str], backend: str = "auto"):
    sw, used = load_stopwords(langs, backend)
    p(f"‚úÖ Stopwords backend: {used} | langs={langs} | count={len(sw)}")
    sw = set(sw)  # ensure set
    def content_words(s: str) -> Set[str]:
        # why: reduce semantic chaining with content word overlap
        words = TOKEN_RE.findall(s.lower())
        return {w for w in words if w not in sw}
    return content_words

# ---------- Short guard ----------
def short_guard(a: str, b: str) -> bool:
    if min(len(a), len(b)) > SHORT_LEN_MAX:
        return True
    if not HAS_RF:
        return True
    return (fuzz_ratio(a, b) / 100.0) >= SHORT_RF_MIN

# ---------- 0) Load inputs ----------
head("INPUTS")
for f in [PAIR_F, MAP_F, SUM_F]:
    p(f"{f} | exists={f.exists()} | size={(f.stat().st_size/1024/1024):.2f} MB" if f.exists() else f"{f} | MISSING")
if not (PAIR_F.exists() and MAP_F.exists() and SUM_F.exists()):
    raise FileNotFoundError("Required inputs missing. Run previous cells (filter + cluster).")

sum_df = pd.read_parquet(SUM_F)
map_df = pd.read_parquet(MAP_F)
p(f"clusters: {len(sum_df):,} | tokens mapped: {len(map_df):,}")

# ---------- 1) Select the largest cluster ----------
head("SELECT LARGEST CLUSTER")
top = sum_df.sort_values(["size","medoid_wdeg"], ascending=[False, False]).iloc[0]
giant_cid = int(top["cluster_id"]); giant_size = int(top["size"])
p(f"Picked cluster_id={giant_cid} | size={giant_size:,} | medoid='{top['medoid']}'")
giant_tokens = map_df.loc[map_df["cluster_id"] == giant_cid, "token"].tolist()
giant_set = set(giant_tokens)
p(f"Collected tokens for giant cluster: {len(giant_tokens):,}")

# ---------- 2) Stopwords-driven content words ----------
head("LOAD STOPWORDS & BUILD CONTENT WORDS")
content_words = build_content_filter(STOP_LANGS, STOP_BACKEND)

# ---------- 3) Precompute content-word sets ----------
head("BUILD CONTENT-WORD SETS (GIANT CLUSTER)")
t0 = time.time()
cw: Dict[str, Set[str]] = {}
for s in giant_tokens:
    cw[s] = content_words(s)
p(f"Built content sets: {len(cw):,} | time={time.time()-t0:.2f}s")

# ---------- 4) Stream edges and keep those that pass gates ----------
head("STREAM & FILTER EDGES WITH LEXICAL OVERLAP")
dsobj = ds.dataset(PAIR_F.as_posix(), format="parquet")
scanner = dsobj.scanner(columns=["token_a","token_b","cosine_sim","rank"], batch_size=250_000)

kept_edges: List[Tuple[str,str,float,int]] = []
dropped = Counter()
t1 = time.time()
rows = 0
for batch in scanner.to_batches():
    df = batch.to_pandas()
    m = df["token_a"].isin(giant_set) & df["token_b"].isin(giant_set)
    if not m.any():
        continue
    sub = df[m]
    for a, b, sim, r in zip(sub["token_a"], sub["token_b"], sub["cosine_sim"], sub["rank"]):
        rows += 1
        s = float(sim)
        if not short_guard(a, b):
            dropped["short_guard_fail"] += 1
            continue
        if s >= BASE_SIM_STRONG:
            kept_edges.append((a, b, s, int(r)))
        elif s >= BASE_SIM_WEAK:
            if len(cw[a] & cw[b]) >= MIN_SHARED_WORDS:
                kept_edges.append((a, b, s, int(r)))
            else:
                dropped["no_word_overlap"] += 1
        else:
            dropped["below_min_sim"] += 1

p(f"Scanned in-giant edges: {rows:,} | kept={len(kept_edges):,} | time={time.time()-t1:.2f}s")
p("Drop reasons:")
for k,v in dropped.items():
    p(f"- {k}: {v:,}")
if not kept_edges:
    raise RuntimeError("No edges retained; relax gates or check stopword config.")

# ---------- 5) Build subgraph + connected components ----------
head("BUILD SUBGRAPH & SUBCLUSTERS")
adj: Dict[str, List[str]] = defaultdict(list)
for a, b, s, r in kept_edges:
    adj[a].append(b); adj[b].append(a)

visited = set()
sub_components: List[List[str]] = []
for node in giant_tokens:
    if node in visited: continue
    if node not in adj:
        visited.add(node); sub_components.append([node]); continue
    q = deque([node]); comp = []
    visited.add(node)
    while q:
        u = q.popleft(); comp.append(u)
        for v in adj[u]:
            if v not in visited:
                visited.add(v); q.append(v)
    sub_components.append(comp)

sizes = sorted((len(c) for c in sub_components), reverse=True)
p(f"Subclusters={len(sub_components):,} | max={sizes[0]:,} | median={int(np.median(sizes))} | mean={np.mean(sizes):.2f}")

# ---------- 6) Medoids ----------
deg = {t: len(adj.get(t, [])) for t in giant_tokens}
sim_sum = defaultdict(float); sim_cnt = defaultdict(int)
for a, b, s, _ in kept_edges:
    sim_sum[a] += s; sim_cnt[a] += 1
    sim_sum[b] += s; sim_cnt[b] += 1
def medoid_of(nodes: List[str]) -> Tuple[str, int, float]:
    best, best_score = None, (-1, -1.0)
    for t in nodes:
        d = deg.get(t, 0)
        w = (sim_sum[t]/sim_cnt[t]) if sim_cnt[t] else 0.0
        if (d, w) > best_score:
            best_score = (d, w); best = t
    return best, best_score[0], best_score[1]

sub_infos = []
for idx, nodes in enumerate(sub_components):
    m_tok, m_deg, m_w = medoid_of(nodes)
    sub_infos.append((idx, len(nodes), m_tok, m_deg, m_w))
sub_sum_df = pd.DataFrame(sub_infos, columns=["sub_id","size","medoid","medoid_deg","medoid_mean_sim"])\
                .sort_values(["size","medoid_deg","medoid_mean_sim"], ascending=[False, False, False])
p("\nTop 15 subclusters:")
p(sub_sum_df.head(15).to_string(index=False))

# ---------- 7) Edge samples ----------
head("EDGE SAMPLES FROM BIG SUBCLUSTERS")
need_sub = set(sub_sum_df.head(5)["sub_id"].tolist())
node_to_sub = {}
for sub_id, nodes in enumerate(sub_components):
    for t in nodes:
        node_to_sub[t] = sub_id

samples = []
quota = Counter()
for a, b, s, r in kept_edges:
    sa = node_to_sub.get(a); sb = node_to_sub.get(b)
    if sa is None or sa != sb or sa not in need_sub: continue
    if quota[sa] >= PRINT_EDGE_SAMPLES: continue
    samples.append((sa, a, b, float(s), int(r)))
    quota[sa] += 1
samples_df = pd.DataFrame(samples, columns=["sub_id","token_a","token_b","cosine_sim","rank"])
p(samples_df.head(60).to_string(index=False))

# ---------- 8) Reassign cluster IDs (refine giant only) ----------
head("REASSIGN CLUSTER IDS (REFINE GIANT ONLY)")
ref_map = map_df.copy()
mask = ref_map["cluster_id"] == giant_cid
ref_map.loc[mask, "cluster_id"] = -1

BASE = int(giant_cid) * 1_000_000
token_to_new = {}
for sub_id, nodes in enumerate(sub_components):
    new_cid = BASE + sub_id
    for t in nodes:
        token_to_new[t] = new_cid
ref_map.loc[mask, "cluster_id"] = ref_map.loc[mask, "token"].map(token_to_new).astype(np.int64)

ref_sum = sum_df.copy()
ref_sum = ref_sum[ref_sum["cluster_id"] != giant_cid]
rows = []
for sub_id, nodes in enumerate(sub_components):
    med, mdeg, mw = medoid_of(nodes)
    rows.append({"cluster_id": BASE + sub_id, "size": len(nodes), "medoid": med,
                 "medoid_deg": mdeg, "medoid_wdeg": mw,
                 "mean_deg": float(np.mean([deg.get(t,0) for t in nodes])),
                 "mean_wdeg": float(np.mean([(sim_sum[t]/sim_cnt[t]) if sim_cnt[t] else 0.0 for t in nodes])),
                 "short_rate": float(np.mean([len(t)<=3 for t in nodes])),
                 "digit_rate": float(np.mean([any(c.isdigit() for c in t) for t in nodes])),
                 "zz_rate": float(np.mean([t.strip().lower().startswith('zz') for t in nodes]))})
ref_sum = pd.concat([ref_sum, pd.DataFrame(rows)], ignore_index=True)\
             .sort_values(["size","medoid_wdeg"], ascending=[False, False])

p(f"Refined clusters: {len(ref_sum):,} (giant ‚Üí {len(sub_components):,} subclusters)")
p("\nRefined top 15 clusters:")
p(ref_sum.head(15).to_string(index=False))

# ---------- 9) Save artifacts ----------
head("SAVE ARTIFACTS")
ref_map.to_parquet(REF_MAP_F, index=False)
ref_sum.to_parquet(REF_SUM_F, index=False)
pd.DataFrame(kept_edges, columns=["token_a","token_b","cosine_sim","rank"]).to_parquet(REF_EDGES_F, index=False)
p(f"‚úÖ Saved refined token‚Üícluster map ‚Üí {REF_MAP_F}")
p(f"‚úÖ Saved refined cluster summary ‚Üí {REF_SUM_F}")
p(f"‚úÖ Saved refined giant-cluster edges ‚Üí {REF_EDGES_F}")

# ---------- 10) Next-step guidance ----------
head("NEXT STEP GUIDANCE")
p("- If overlap is too strict, reduce MIN_SHARED_WORDS or lower BASE_SIM_STRONG (e.g., 0.58).")
p("- For multilingual data, set STOP_LANGS=['en','es','de',...] to broaden stopword removal.")
p("- Export canonical map (token ‚Üí medoid) and apply to shelves next.")

üîé INPUTS
romance-novel-nlp-research/src/eda_analysis/outputs/candidate_similarity_pairs.filtered.parquet | exists=True | size=19.02 MB
romance-novel-nlp-research/src/eda_analysis/outputs/clusters_token_map.parquet | exists=True | size=4.22 MB
romance-novel-nlp-research/src/eda_analysis/outputs/clusters_summary.parquet | exists=True | size=0.16 MB
clusters: 4,223 | tokens mapped: 232,918
üîé SELECT LARGEST CLUSTER
Picked cluster_id=0 | size=218,379 | medoid='manga graphic novels comics'
Collected tokens for giant cluster: 218,379
üîé LOAD STOPWORDS & BUILD CONTENT WORDS
‚úÖ Stopwords backend: sklearn | langs=['en'] | count=318
üîé BUILD CONTENT-WORD SETS (GIANT CLUSTER)
Built content sets: 218,379 | time=1.70s
üîé STREAM & FILTER EDGES WITH LEXICAL OVERLAP
Scanned in-giant edges: 1,155,628 | kept=1,135,066 | time=9.67s
Drop reasons:
- short_guard_fail: 1,946
- no_word_overlap: 18,616
üîé BUILD SUBGRAPH & SUBCLUSTERS
Subclusters=2,797 | max=213,904 | median=1 | mean=78.08

Top 15

In [14]:
# romance-novel-nlp-research/src/eda_analysis/cell3e_explore_refined_outputs.py
# Heavy-print audit of refined clustering to pick the next feature-engineering gates.

from __future__ import annotations
import sys, re, json, math, time
from pathlib import Path
from collections import Counter, defaultdict
from typing import Dict, List, Set, Tuple

import numpy as np
import pandas as pd

# ---------------------------------- Paths ----------------------------------
OUT = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
MAP_F  = OUT / "clusters_token_map.refined.parquet"         # from cell3d
SUM_F  = OUT / "clusters_summary.refined.parquet"           # from cell3d
EDGES_F = OUT / "candidate_similarity_pairs.cluster0_refined.parquet"  # giant-cluster edges

# -------------------------------- Printing ---------------------------------
def p(x: str) -> None:
    print(x); sys.stdout.flush()

def head(title: str) -> None:
    sep = "=" * 120
    p(sep); p(f"üîé {title}"); p(sep)

# ------------------------------ Stopwords utils ----------------------------
# Why: content-words reduce semantic chaining. Keep portable; no downloads required.
def load_stopwords() -> Set[str]:
    try:
        from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
        sw = set(w.lower() for w in ENGLISH_STOP_WORDS)
        p(f"‚úÖ stopwords: sklearn ({len(sw)})")
        return sw
    except Exception:
        base = {"a","an","the","and","or","to","of","in","on","for","with","by","as","from",
                "this","that","these","those","it","its","is","are","be","was","were","been",
                "i","you","we","he","she","they","me","him","her","them","my","your","our","their",
                "not","no","yes","into","over","under","up","down","out","more","most","less","very"}
        p(f"‚ö†Ô∏è stopwords: fallback ({len(base)})")
        return base

TOKEN_RE = re.compile(r"[a-z]{2,}")  # why: stable content tokens

def content_words(s: str, sw: Set[str]) -> Set[str]:
    return {w for w in TOKEN_RE.findall(s.lower()) if w not in sw}

def digit_ratio(s: str) -> float:
    if not s: return 0.0
    digits = sum(c.isdigit() for c in s)
    return digits / len(s)

def token_shape(s: str) -> str:
    # why: spot numeric/symbolic pseudo-tags
    out = []
    for c in s:
        if c.isalpha(): out.append('A')
        elif c.isdigit(): out.append('D')
        elif c.isspace(): out.append('_')
        else: out.append('#')
    return ''.join(out)

# ----------------------------- Optional RapidFuzz ---------------------------
try:
    from rapidfuzz.fuzz import ratio as fuzz_ratio
    HAS_RF = True
    p("‚úÖ RapidFuzz available")
except Exception:
    HAS_RF = False
    p("‚ÑπÔ∏è RapidFuzz not available (skip char-sim prints)")

# --------------------------------- Load ------------------------------------
head("INPUTS")
for f in [MAP_F, SUM_F, EDGES_F]:
    p(f"{f} | exists={f.exists()} | size={(f.stat().st_size/1024/1024):.2f} MB" if f.exists() else f"{f} | MISSING")
if not (MAP_F.exists() and SUM_F.exists() and EDGES_F.exists()):
    raise FileNotFoundError("Missing refined artifacts. Run cell3d_refine_giant_cluster.py first.")

map_df = pd.read_parquet(MAP_F)             # token, cluster_id, ...
sum_df = pd.read_parquet(SUM_F)             # cluster summary
edges = pd.read_parquet(EDGES_F)            # token_a, token_b, cosine_sim, rank

# Identify giant cluster id base
giant_row = sum_df.sort_values(["size","medoid_wdeg"], ascending=[False, False]).iloc[0]
giant_cid = int(giant_row["cluster_id"])
p(f"\nGiant cluster id (refined namespace): {giant_cid} | size={int(giant_row['size']):,}")

giant_tokens = set(map_df.loc[map_df["cluster_id"] == giant_cid, "token"].tolist())
p(f"Giant tokens mapped: {len(giant_tokens):,}")
p(f"Giant edges (refined file rows): {len(edges):,}")

# --------------------------------- Stats 1 ---------------------------------
head("EDGE & TOKEN STATS (GIANT CLUSTER)")
p(edges["cosine_sim"].describe(percentiles=[.5,.75,.9,.95,.99]).to_string())

# token shapes / digit share
toks = list(giant_tokens)
dr = np.array([digit_ratio(t) for t in toks], dtype=np.float32)
p(f"\nDigit ratio (tokens): mean={dr.mean():.3f} | p90={np.quantile(dr,0.90):.2f} | p99={np.quantile(dr,0.99):.2f}")
shape_counts = Counter(token_shape(t)[:12] for t in toks)  # prefix to compact
p("Top token shape prefixes:")
for shp, cnt in shape_counts.most_common(20):
    p(f"- {shp!r}: {cnt:,}")

# --------------------------------- Stats 2 ---------------------------------
head("CONTENT WORDS + OVERLAP STATS")
SW = load_stopwords()
# Precompute cw for all tokens appearing in edges (reduce memory)
nodes_in_edges = set(edges["token_a"]).union(set(edges["token_b"]))
cw: Dict[str, Set[str]] = {}
for s in nodes_in_edges:
    cw[s] = content_words(s, SW)

# Compute overlap, Jaccard, and (optional) RapidFuzz on a large sample
sample_n = min(250_000, len(edges))
sample = edges.sample(n=sample_n, random_state=7).copy()
sample["_cw_a"] = [cw[a] for a in sample["token_a"]]
sample["_cw_b"] = [cw[b] for b in sample["token_b"]]
sample["_shared"] = [len(a & b) for a,b in zip(sample["_cw_a"], sample["_cw_b"])]
sample["_union"]  = [max(1, len(a | b)) for a,b in zip(sample["_cw_a"], sample["_cw_b"])]
sample["_jacc"]   = sample["_shared"] / sample["_union"]
if HAS_RF:
    sample["_rf"] = [fuzz_ratio(a,b)/100.0 for a,b in zip(sample["token_a"], sample["token_b"])]

p(f"Sampled edges: {sample_n:,}")
p("\nShared-word counts (sample):")
p(sample["_shared"].describe(percentiles=[.5,.75,.9,.95,.99]).to_string())
p("\nJaccard (cw) (sample):")
p(sample["_jacc"].describe(percentiles=[.5,.75,.9,.95,.99]).to_string())
if HAS_RF:
    p("\nRapidFuzz char-sim (sample):")
    p(sample["_rf"].describe(percentiles=[.5,.75,.9,.95,.99]).to_string())

# Buckets
def hist(series: pd.Series, bins: List[float]) -> List[Tuple[str,int]]:
    arr = series.to_numpy(np.float32, copy=False)
    out = []
    for i in range(len(bins)-1):
        lo, hi = bins[i], bins[i+1]
        cnt = int(((arr >= lo) & (arr < hi)).sum())
        out.append((f"[{lo:.2f},{hi:.2f})", cnt))
    return out

p("\nJaccard histogram:")
for rng, cnt in hist(sample["_jacc"], [0, .05, .10, .15, .20, .30, .40, .50, 1.01]):
    p(f"{rng}: {cnt:,}")

p("\nShared-word histogram:")
for rng, cnt in hist(sample["_shared"].astype(np.float32), [0,1,2,3,4,5,10,1000]):
    p(f"{rng}: {cnt:,}")

if HAS_RF:
    p("\nLow-char but high-sim examples (rf<0.5 & cos>=0.8) (up to 30):")
    zz = sample[(sample["_rf"]<0.5) & (sample["cosine_sim"]>=0.8)].head(30)
    p(zz[["token_a","token_b","cosine_sim","_shared","_jacc","_rf"]].to_string(index=False))

# -------------------------------- What-if gates ----------------------------
head("WHAT-IF EDGE RETENTION (SAMPLE)")
def keep_edge(sim: float, shared: int, jacc: float) -> Dict[str,bool]:
    return {
        "A_strict": (sim >= 0.65) or (sim >= 0.55 and shared >= 2) or (sim >= 0.50 and shared >= 3),
        "B_jacc":   (jacc >= 0.20) and (sim >= 0.55),
        "C_tight":  (sim >= 0.70) or (sim >= 0.60 and shared >= 2)
    }

ret = {"A_strict":0, "B_jacc":0, "C_tight":0}
for sim, sh, j in zip(sample["cosine_sim"], sample["_shared"], sample["_jacc"]):
    d = keep_edge(float(sim), int(sh), float(j))
    for k in ret: ret[k] += int(d[k])

p(f"Total sample: {sample_n:,}")
for k,v in ret.items():
    p(f"- {k}: keep {v:,} ({v/sample_n*100:.1f}%)")

# -------------------------------- Bridges ----------------------------------
head("BRIDGING TOKEN CANDIDATES")
# Why: identify nodes that connect many disparate word themes (likely generic/noisy).
adj = defaultdict(list)
for a,b,sim in zip(edges["token_a"], edges["token_b"], edges["cosine_sim"]):
    adj[a].append(b); adj[b].append(a)

def entropy(counts: Counter) -> float:
    total = sum(counts.values()) or 1
    H = 0.0
    for c in counts.values():
        p = c/total
        H -= p*math.log(p+1e-12)
    return H

bridge_rows = []
t0 = time.time()
for t, neighs in adj.items():
    if len(neighs) < 10:  # focus on sufficiently connected
        continue
    wc = Counter()
    for n in neighs[:200]:  # cap for speed
        wc.update(cw.get(n) or content_words(n, SW))
    H = entropy(wc)
    uniq = len(wc)
    topk = sum(c for _, c in wc.most_common(5))
    bridge_rows.append((t, len(neighs), uniq, H, topk))
p(f"Scanned {len(bridge_rows):,} candidates in {time.time()-t0:.2f}s")

bridge_df = pd.DataFrame(bridge_rows, columns=["token","degree","neighbor_word_uniq","entropy","top5_word_hits"])\
             .sort_values(["entropy","neighbor_word_uniq","degree"], ascending=[False, False, False])
p("\nTop 30 bridge-like tokens:")
p(bridge_df.head(30).to_string(index=False))

# --------------------------- Offender token shapes -------------------------
head("TOKEN SHAPE / DIGIT OFFENDERS")
tok_stats = []
for t in list(giant_tokens)[:300000]:
    dr = digit_ratio(t)
    shp = token_shape(t)
    tok_stats.append((t, dr, shp))
tok_df = pd.DataFrame(tok_stats, columns=["token","digit_ratio","shape"])
off = tok_df[tok_df["digit_ratio"] >= 0.50].sort_values("digit_ratio", ascending=False)
p(f"Tokens with digit_ratio>=0.50: {len(off):,} (show 30)")
p(off.head(30).to_string(index=False))

shape_top = tok_df["shape"].value_counts().head(20)
p("\nTop 20 shapes:")
p(shape_top.to_string())

# --------------------------------- Guidance --------------------------------
head("ACTIONABLE NEXT-STEP SUGGESTIONS (DERIVED)")
# Numbers for quick decision
j_low = int((sample["_jacc"] < 0.10).mean()*100)
j_mid = int((sample["_jacc"] < 0.20).mean()*100)
sw_ge2 = int((sample["_shared"] >= 2).mean()*100)
p(f"- ~{j_low}% of sampled edges have Jaccard<0.10; ~{j_mid}% <0.20; ~{sw_ge2}% have ‚â•2 shared words.")
p("- If many edges rely on zero/one shared word, require ‚â•2 shared words for sim<0.60.")
p("- If 'A_strict' retains ‚â§~60% of sample but clusters remain coherent, adopt it. Otherwise try 'C_tight'.")
p("- Add digit/shape guard: drop tokens with digit_ratio‚â•0.5 unless paired with ‚â•2 shared words.")
p("- Build a small stoplist from top bridge tokens (above) if they are meta/noise tags.")
p("- Re-run filter with: BASE_SIM_STRONG in [0.65,0.70], MIN_SHARED_WORDS in [2,3], plus digit/shape gates.")


‚úÖ RapidFuzz available
üîé INPUTS
romance-novel-nlp-research/src/eda_analysis/outputs/clusters_token_map.refined.parquet | exists=True | size=4.24 MB
romance-novel-nlp-research/src/eda_analysis/outputs/clusters_summary.refined.parquet | exists=True | size=0.22 MB
romance-novel-nlp-research/src/eda_analysis/outputs/candidate_similarity_pairs.cluster0_refined.parquet | exists=True | size=17.83 MB

Giant cluster id (refined namespace): 0 | size=213,904
Giant tokens mapped: 213,904
Giant edges (refined file rows): 1,135,066
üîé EDGE & TOKEN STATS (GIANT CLUSTER)
count    1.135066e+06
mean     8.118041e-01
std      9.405289e-02
min      5.005934e-01
50%      8.194881e-01
75%      8.861359e-01
90%      9.316543e-01
95%      9.531078e-01
99%      9.795010e-01
max      9.997207e-01

Digit ratio (tokens): mean=0.032 | p90=0.12 | p99=0.50
Top token shape prefixes:
- 'AAAAAAA_AAAA': 14,429
- 'AAAAAA_AAAAA': 12,950
- 'AAAAAAAA_AAA': 10,149
- 'AAAAA_AAAAAA': 9,805
- 'AAAAAAAAA_AA': 8,009
- 'AAAA

In [15]:
# romance-novel-nlp-research/src/eda_analysis/cell3f_export_canonical_map.py
from __future__ import annotations
import sys
from pathlib import Path
import pandas as pd
import numpy as np

OUT = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
# Prefer refined artifacts
MAP_REFINED  = OUT / "clusters_token_map.refined.parquet"
SUM_REFINED  = OUT / "clusters_summary.refined.parquet"
# Fallback (pre-refine)
MAP_BASE     = OUT / "clusters_token_map.parquet"
SUM_BASE     = OUT / "clusters_summary.parquet"

# Outputs
CANON_PARQUET = OUT / "token_canonical_map.parquet"
CANON_CSV     = OUT / "token_canonical_map.csv"

def p(x: str) -> None:
    print(x); sys.stdout.flush()

def pick_sources() -> tuple[Path, Path, str]:
    """Why: use refined if present; else fallback and warn."""
    if MAP_REFINED.exists() and SUM_REFINED.exists():
        return MAP_REFINED, SUM_REFINED, "refined"
    if MAP_BASE.exists() and SUM_BASE.exists():
        p("‚ö†Ô∏è Refined artifacts not found. Falling back to base clustering outputs.")
        return MAP_BASE, SUM_BASE, "base"
    raise FileNotFoundError("Neither refined nor base cluster artifacts are available.")

def compute_fallback_medoids(map_df: pd.DataFrame) -> pd.DataFrame:
    """Why: ensure medoid for all clusters; choose token with max (degree, wdegree)."""
    need_cols = {"cluster_id","token"}
    if "degree" in map_df.columns: need_cols.add("degree")
    else: map_df["degree"] = 0
    if "wdegree" in map_df.columns: need_cols.add("wdegree")
    else: map_df["wdegree"] = 0.0
    g = map_df[list(need_cols)].copy()
    # rank: higher degree, then higher wdegree
    g["_rk"] = g.groupby("cluster_id").apply(
        lambda d: (-d["degree"].to_numpy(), -d["wdegree"].to_numpy())
    ).reset_index(level=0, drop=True)
    # Pandas can't sort by tuple directly across groups; do argsort per group
    def top1(df: pd.DataFrame) -> pd.Series:
        idx = np.lexsort(( -df["wdegree"].to_numpy(), -df["degree"].to_numpy() ))
        # np.lexsort sorts by last key first; we negated to make it descending.
        # Take first index of sorted order
        return df.iloc[idx[0]][["cluster_id","token","degree","wdegree"]]
    top = g.groupby("cluster_id", sort=False).apply(top1).reset_index(drop=True)
    top.rename(columns={"token":"medoid"}, inplace=True)
    return top[["cluster_id","medoid"]]

def main() -> None:
    map_p, sum_p, mode = pick_sources()
    p("============================================================")
    p(f"üîé SOURCES  ({mode})")
    p("============================================================")
    p(f"token map : {map_p} | exists={map_p.exists()}")
    p(f"summary   : {sum_p} | exists={sum_p.exists()}")

    map_df = pd.read_parquet(map_p)  # expected: token, cluster_id, (degree,wdegree,flags‚Ä¶)
    sum_df = pd.read_parquet(sum_p)  # expected: cluster_id, size, medoid, ‚Ä¶

    p(f"\nRows: token_map={len(map_df):,} | clusters={len(sum_df):,}")
    missing_medoid = sum_df["medoid"].isna().sum() if "medoid" in sum_df.columns else len(sum_df)
    if missing_medoid:
        p(f"‚ö†Ô∏è {missing_medoid:,} clusters missing medoid in summary ‚Üí computing fallbacks.")
        medoids_fallback = compute_fallback_medoids(map_df)
        sum_df = sum_df.merge(medoids_fallback, on="cluster_id", how="left", suffixes=("","_fallback"))
        sum_df["medoid"] = sum_df["medoid"].fillna(sum_df["medoid_fallback"])
        sum_df.drop(columns=[c for c in sum_df.columns if c.endswith("_fallback")], inplace=True)

    medoids = sum_df[["cluster_id","medoid"]].dropna().copy()
    medoids["cluster_id"] = medoids["cluster_id"].astype("int64")

    # Build canonical map
    keep_cols = ["token","cluster_id"]
    extras = [c for c in ["degree","wdegree","is_short","has_digit","starts_zz"] if c in map_df.columns]
    df = map_df[keep_cols + extras].copy()
    df["cluster_id"] = df["cluster_id"].astype("int64")
    df = df.merge(medoids, on="cluster_id", how="left")
    df.rename(columns={"medoid":"canonical_label"}, inplace=True)

    # Sanity
    no_cano = int(df["canonical_label"].isna().sum())
    if no_cano:
        p(f"‚ö†Ô∏è {no_cano:,} tokens missing canonical_label after join (will drop).")
        df = df.dropna(subset=["canonical_label"]).reset_index(drop=True)

    # Save artifacts
    p("\n============================================================")
    p("üíæ SAVING CANONICAL MAP")
    p("============================================================")
    df_out = df[["token","cluster_id","canonical_label"] + extras].copy()
    df_out.to_parquet(CANON_PARQUET, index=False)
    df_out.to_csv(CANON_CSV, index=False)
    p(f"‚úÖ Parquet ‚Üí {CANON_PARQUET}")
    p(f"‚úÖ CSV     ‚Üí {CANON_CSV}")

    # Coverage & quick stats
    n_tokens = df_out["token"].nunique()
    n_clusters = df_out["cluster_id"].nunique()
    n_cano = df_out["canonical_label"].nunique()
    self_maps = int((df_out["token"] == df_out["canonical_label"]).sum())
    p("\n============================================================")
    p("üìà SUMMARY")
    p("============================================================")
    p(f"Tokens mapped      : {n_tokens:,}")
    p(f"Clusters covered   : {n_clusters:,}")
    p(f"Canonical labels   : {n_cano:,}")
    p(f"Identity mappings  : {self_maps:,} ({self_maps/max(1,n_tokens):.2%})")

    # Top canonical labels by cluster size (from summary if available)
    if "size" in sum_df.columns:
        top = sum_df.sort_values(["size","medoid_wdeg" if "medoid_wdeg" in sum_df.columns else "size"], ascending=[False, False]).head(10)
        p("\nTop 10 canonical labels by cluster size:")
        p(top[["cluster_id","medoid","size"]].to_string(index=False))

    # 200 random rewrites where token != canonical_label
    p("\n============================================================")
    p("üîç SAMPLE REWRITES (200)  ‚Äî token ‚Üí canonical_label")
    p("============================================================")
    rng = np.random.default_rng(seed=42)
    diff = df_out[df_out["token"] != df_out["canonical_label"]]
    sample_n = min(200, len(diff)) if len(diff) else min(200, len(df_out))
    sample = diff.sample(sample_n, random_state=42) if len(diff) else df_out.sample(sample_n, random_state=42)
    show_cols = ["token","canonical_label","cluster_id"] + [c for c in ["degree","wdegree"] if c in sample.columns]
    # Order for readability
    sample = sample[show_cols].sort_values(["cluster_id","canonical_label","token"]).reset_index(drop=True)
    print(sample.to_string(index=False, max_colwidth=80))

if __name__ == "__main__":
    main()


üîé SOURCES  (refined)
token map : romance-novel-nlp-research/src/eda_analysis/outputs/clusters_token_map.refined.parquet | exists=True
summary   : romance-novel-nlp-research/src/eda_analysis/outputs/clusters_summary.refined.parquet | exists=True

Rows: token_map=232,918 | clusters=7,019

üíæ SAVING CANONICAL MAP
‚úÖ Parquet ‚Üí romance-novel-nlp-research/src/eda_analysis/outputs/token_canonical_map.parquet
‚úÖ CSV     ‚Üí romance-novel-nlp-research/src/eda_analysis/outputs/token_canonical_map.csv

üìà SUMMARY
Tokens mapped      : 232,918
Clusters covered   : 4,223
Canonical labels   : 7,019
Identity mappings  : 7,019 (3.01%)

Top 10 canonical labels by cluster size:
 cluster_id                         medoid   size
          0    manga graphic novels comics 213904
          1               150 to 200 pages    328
          2      meet n greet 2015 dec jan    140
          3           erotica bdsm romance     96
          4              01 june utc bonus     77
          6 2015 read

In [16]:
# romance-novel-nlp-research/src/eda_analysis/cell3g_split_giant_community_detection.py
"""
Split the giant cluster via community detection on a filtered, weighted subgraph.
Re-export canonical labels (medoids) and print 200 random rewrites.
"""

from __future__ import annotations
import sys, re, time, math
from pathlib import Path
from typing import Dict, Set, List, Tuple
from collections import defaultdict, Counter

import numpy as np
import pandas as pd

# ---------------- Paths ----------------
OUT = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
MAP_REF   = OUT / "clusters_token_map.refined.parquet"
SUM_REF   = OUT / "clusters_summary.refined.parquet"
EDGES_G0  = OUT / "candidate_similarity_pairs.cluster0_refined.parquet"  # edges within giant cluster

# Outputs
MAP_COM   = OUT / "clusters_token_map.community.parquet"
SUM_COM   = OUT / "clusters_summary.community.parquet"
CANON_PQ  = OUT / "token_canonical_map.community.parquet"
CANON_CSV = OUT / "token_canonical_map.community.csv"
EDGES_USED= OUT / "giant_comm_edges_used.parquet"

# --------------- Config ----------------
SIM_STRONG: float = 0.65
SIM_WEAK: float   = 0.55
MIN_SHARED_WEAK: int = 2
WEIGHT_ALPHA: float = 0.5          # weight = sim * (1 + alpha * jaccard)
SHORT_LEN_MAX: int = 3
SHORT_RF_MIN: float = 0.70
PRINT_SUBCOMM: int = 15
REWRITE_SAMPLE_N: int = 200

# --------------- Imports ---------------
try:
    import networkx as nx
except Exception as e:
    raise RuntimeError("Requires networkx. Install: pip install networkx") from e

try:
    from rapidfuzz.fuzz import ratio as fuzz_ratio
    HAS_RF = True
except Exception:
    HAS_RF = False

try:
    from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
    STOP = set(w.lower() for w in ENGLISH_STOP_WORDS)
except Exception as e:
    STOP = {"a","an","the","and","or","to","of","in","on","for","with","by","as","from",
            "this","that","these","those","it","its","is","are","be","was","were","been",
            "i","you","we","he","she","they","me","him","her","them","my","your","our","their",
            "not","no","yes","into","over","under","up","down","out","more","most","less","very"}

TOKEN_RE = re.compile(r"[a-z]{2,}")

# ------------- Helpers -----------------
def p(x: str) -> None:
    print(x); sys.stdout.flush()

def content_words(s: str) -> Set[str]:
    return {w for w in TOKEN_RE.findall(s.lower()) if w not in STOP}

def short_guard(a: str, b: str) -> bool:
    if min(len(a), len(b)) > SHORT_LEN_MAX:
        return True
    if not HAS_RF:
        return True
    return (fuzz_ratio(a, b) / 100.0) >= SHORT_RF_MIN

def medoid_of(nodes: List[str], adj: Dict[str, List[Tuple[str,float]]]) -> Tuple[str, int, float]:
    best, best_score = None, (-1, -1.0)
    for t in nodes:
        nbrs = adj.get(t, [])
        d = len(nbrs)
        w = (sum(s for _, s in nbrs) / d) if d else 0.0
        if (d, w) > best_score:
            best_score = (d, w); best = t
    return best, best_score[0], best_score[1]

# --------------- Main ------------------
def main() -> None:
    # Load inputs
    if not (MAP_REF.exists() and SUM_REF.exists() and EDGES_G0.exists()):
        raise FileNotFoundError("Missing refined artifacts. Ensure previous cells finished.")
    map_df = pd.read_parquet(MAP_REF)    # token, cluster_id, degree, wdegree, flags
    sum_df = pd.read_parquet(SUM_REF)    # cluster_id, size, medoid, ...
    edges0 = pd.read_parquet(EDGES_G0)   # token_a, token_b, cosine_sim, rank

    # Identify giant cluster id (largest row in refined summary)
    giant_row = sum_df.sort_values(["size","medoid_wdeg" if "medoid_wdeg" in sum_df.columns else "size"],
                                   ascending=[False, False]).iloc[0]
    giant_cid = int(giant_row["cluster_id"])
    g_tokens = set(map_df.loc[map_df["cluster_id"] == giant_cid, "token"])
    p("============================================================")
    p("üîé INPUTS")
    p("============================================================")
    p(f"giant cluster_id={giant_cid} | size={len(g_tokens):,}")
    p(f"giant edges file: {EDGES_G0} | rows={len(edges0):,}")

    # Precompute content words
    t0 = time.time()
    all_nodes = set(edges0["token_a"]).union(set(edges0["token_b"]))
    cw = {s: content_words(s) for s in all_nodes}
    p(f"Content-word sets: {len(cw):,} | time={time.time()-t0:.2f}s")

    # Filter edges + compute weights
    kept = []
    dropped = Counter()
    t1 = time.time()
    for a, b, sim in zip(edges0["token_a"], edges0["token_b"], edges0["cosine_sim"]):
        if not short_guard(a, b):
            dropped["short_guard_fail"] += 1
            continue
        A, B = cw[a], cw[b]
        shared = len(A & B)
        union = max(1, len(A | B))
        jacc = shared / union
        s = float(sim)
        keep = (s >= SIM_STRONG) or (s >= SIM_WEAK and shared >= MIN_SHARED_WEAK)
        if not keep:
            dropped["gate_fail"] += 1
            continue
        w = s * (1.0 + WEIGHT_ALPHA * jacc)
        kept.append((a, b, s, shared, jacc, w))
    p("============================================================")
    p("üîé EDGE FILTER SUMMARY (GIANT)")
    p("============================================================")
    p(f"Kept edges: {len(kept):,} / {len(edges0):,}  | time={time.time()-t1:.2f}s")
    for k,v in dropped.items():
        p(f"- {k}: {v:,}")

    if not kept:
        raise RuntimeError("No edges kept; relax thresholds.")

    # Build weighted graph
    G = nx.Graph()
    G.add_nodes_from(g_tokens)
    for a, b, s, shared, jacc, w in kept:
        G.add_edge(a, b, weight=w, sim=s, shared=shared, jacc=jacc)
    p(f"Graph nodes={G.number_of_nodes():,} | edges={G.number_of_edges():,}")

    # Community detection (asynchronous label propagation, weighted)
    t2 = time.time()
    comms = list(nx.algorithms.community.asyn_lpa_communities(G, weight="weight", seed=42))
    p("============================================================")
    p("üîé COMMUNITY DETECTION")
    p("============================================================")
    p(f"Communities: {len(comms):,} | time={time.time()-t2:.2f}s")

    sizes = sorted([len(c) for c in comms], reverse=True)
    p("Size histogram:")
    bins = [1,2,3,5,10,20,50,100,200,500,1000,999999]
    for i in range(len(bins)-1):
        lo, hi = bins[i], bins[i+1]-1
        cnt = sum(1 for s in sizes if lo <= s <= hi)
        p(f"{lo}-{hi}: {cnt:,}")
    p(f"max={sizes[0]:,} | median={int(np.median(sizes))} | mean={np.mean(sizes):.2f}")

    # Build adjacency for medoids
    adj: Dict[str, List[Tuple[str,float]]] = defaultdict(list)
    for a, b, s, shared, jacc, w in kept:
        adj[a].append((b, s)); adj[b].append((a, s))

    # New IDs for giant subcommunities (stable)
    BASE = giant_cid * 1_000_000
    rows_sum = []
    token_to_new = {}
    for sub_id, nodes in enumerate(comms):
        nodes_list = list(nodes)
        med, mdeg, mw = medoid_of(nodes_list, adj)
        new_cid = BASE + sub_id
        for t in nodes_list:
            token_to_new[t] = new_cid
        rows_sum.append({
            "cluster_id": new_cid, "size": len(nodes_list), "medoid": med,
            "medoid_deg": int(mdeg), "medoid_wdeg": float(mw),
            "mean_deg": float(np.mean([len(adj.get(t, [])) for t in nodes_list])),
            "mean_wdeg": float(np.mean([(sum(s for _, s in adj.get(t, []))/max(1,len(adj.get(t, []))))
                                        for t in nodes_list])),
            "short_rate": float(np.mean([len(t)<=3 for t in nodes_list])),
            "digit_rate": float(np.mean([any(c.isdigit() for c in t) for t in nodes_list])),
            "zz_rate": float(np.mean([t.strip().lower().startswith('zz') for t in nodes_list]))
        })

    # Merge with non-giant clusters
    map_com = map_df.copy()
    mask = map_com["cluster_id"] == giant_cid
    map_com.loc[mask, "cluster_id"] = map_com.loc[mask, "token"].map(token_to_new).astype("int64")
    # Summary
    sum_non = sum_df[sum_df["cluster_id"] != giant_cid].copy()
    sum_giant_new = pd.DataFrame(rows_sum)
    sum_com = pd.concat([sum_non, sum_giant_new], ignore_index=True)\
                 .sort_values(["size","medoid_wdeg"], ascending=[False, False])

    # Canonical map (token ‚Üí medoid)
    medoids = sum_com[["cluster_id","medoid"]].copy()
    canon = map_com[["token","cluster_id"]].merge(medoids, on="cluster_id", how="left")
    canon.rename(columns={"medoid":"canonical_label"}, inplace=True)

    # Save artifacts
    pd.DataFrame(kept, columns=["token_a","token_b","cosine_sim","shared","jaccard","weight"]).to_parquet(EDGES_USED, index=False)
    map_com.to_parquet(MAP_COM, index=False)
    sum_com.to_parquet(SUM_COM, index=False)
    canon.to_parquet(CANON_PQ, index=False)
    canon.to_csv(CANON_CSV, index=False)

    # Prints
    p("============================================================")
    p("üíæ SAVED")
    p("============================================================")
    p(f"Edges used            : {EDGES_USED}")
    p(f"Token map (community) : {MAP_COM}")
    p(f"Summary (community)   : {SUM_COM}")
    p(f"Canonical map         : {CANON_PQ}")
    p(f"Canonical CSV         : {CANON_CSV}")

    # Spot-check 200 rewrites (token != canonical)
    p("============================================================")
    p("üîç SAMPLE REWRITES (200) ‚Äî token ‚Üí canonical_label")
    p("============================================================")
    diff = canon[canon["token"] != canon["canonical_label"]]
    sample = diff.sample(min(REWRITE_SAMPLE_N, len(diff)), random_state=42) if len(diff) else canon.sample(min(REWRITE_SAMPLE_N, len(canon)), random_state=42)
    print(sample.sort_values(["cluster_id","canonical_label","token"]).to_string(index=False, max_colwidth=80))

if __name__ == "__main__":
    main()

üîé INPUTS
giant cluster_id=0 | size=213,904
giant edges file: romance-novel-nlp-research/src/eda_analysis/outputs/candidate_similarity_pairs.cluster0_refined.parquet | rows=1,135,066
Content-word sets: 216,300 | time=2.93s
üîé EDGE FILTER SUMMARY (GIANT)
Kept edges: 1,076,118 / 1,135,066  | time=6.55s
- gate_fail: 58,948
Graph nodes=215,776 | edges=538,059
üîé COMMUNITY DETECTION
Communities: 35,766 | time=77.81s
Size histogram:
1-1: 5,261
2-2: 5,672
3-4: 8,019
5-9: 9,417
10-19: 6,385
20-49: 1,012
50-99: 0
100-199: 0
200-499: 0
500-999: 0
1000-999998: 0
max=49 | median=4 | mean=6.03
üíæ SAVED
Edges used            : romance-novel-nlp-research/src/eda_analysis/outputs/giant_comm_edges_used.parquet
Token map (community) : romance-novel-nlp-research/src/eda_analysis/outputs/clusters_token_map.community.parquet
Summary (community)   : romance-novel-nlp-research/src/eda_analysis/outputs/clusters_summary.community.parquet
Canonical map         : romance-novel-nlp-research/src/eda_analys

In [19]:
# romance-novel-nlp-research/src/eda_analysis/cell3h_canonicalize_merge_labels.py
"""
Re-score canonical labels for community clusters with quality heuristics and synthetic labels.
Outputs new canonical map + summary and prints 200 before/after rewrites for QA.
"""

from __future__ import annotations
import re, sys, math, time
from pathlib import Path
from typing import Dict, List, Tuple, Set
from collections import Counter, defaultdict

import numpy as np
import pandas as pd

# ----------------------------- Paths -----------------------------
OUT = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
MAP_COM   = OUT / "clusters_token_map.community.parquet"
SUM_COM   = OUT / "clusters_summary.community.parquet"
EDGES_USED= OUT / "giant_comm_edges_used.parquet"   # edges from giant-cluster refinement

# Outputs (v2 canonicalization)
SUM_V2    = OUT / "clusters_summary.community.v2.parquet"
CANON_V2_PQ = OUT / "token_canonical_map.community.v2.parquet"
CANON_V2_CSV= OUT / "token_canonical_map.community.v2.csv"

# --------------------------- Config knobs ------------------------
SYNTH_TOP_K_WORDS: int = 3
SYNTH_MIN_WORDS: int = 2

# Quality scoring configuration
class QUALITY:
    # weights to balance structure + string quality
    DEG_W: float = 0.60
    MEAN_SIM_W: float = 0.40
    CW_BONUS_W: float = 0.20
    # penalties
    DIGIT_PENALTY: float = 0.70   # scaled by digit_ratio
    SHORT_PENALTY: float = 0.60   # len<4
    ZZ_PENALTY: float = 0.80      # startswith 'zz'
    NONALPHA_PENALTY: float = 0.50 # 1 - alpha_ratio

# selection rules
PREFER_SYNTH_IF_LOW_QUALITY: bool = True
LOW_QUALITY_SCORE_CUTOFF: float = 0.30
ALPHA_RATIO_MIN_FOR_TOKEN_LABEL: float = 0.65
MAX_LABEL_LEN: int = 80

# ----------------------------- Utils -----------------------------
def p(x: str) -> None:
    print(x); sys.stdout.flush()

try:
    from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
    STOP = set(w.lower() for w in ENGLISH_STOP_WORDS)
except Exception:
    STOP = {"a","an","the","and","or","to","of","in","on","for","with","by","as","from",
            "this","that","these","those","it","its","is","are","be","was","were","been",
            "i","you","we","he","she","they","me","him","her","them","my","your","our","their",
            "not","no","yes","into","over","under","up","down","out","more","most","less","very"}

TOKEN_RE = re.compile(r"[a-z]{2,}")

def content_words(s: str) -> List[str]:
    return [w for w in TOKEN_RE.findall(s.lower()) if w not in STOP]

def digit_ratio(s: str) -> float:
    if not s: return 0.0
    d = sum(c.isdigit() for c in s); return d/len(s)

def alpha_ratio(s: str) -> float:
    if not s: return 0.0
    a = sum(c.isalpha() for c in s); return a/len(s)

def is_bad_prefix(s: str) -> bool:
    x = s.strip().lower()
    return x.startswith("zz") or x.startswith("0 ") or x in {"good","bad","ok","okay"}

def clamp(x: float, a: float, b: float) -> float:
    return max(a, min(b, x))

# ------------------------ Load artifacts -------------------------
p("============================================================")
p("üîé INPUTS")
p("============================================================")
for f in [MAP_COM, SUM_COM]:
    p(f"{f} | exists={f.exists()} | size={(f.stat().st_size/1024/1024):.2f} MB" if f.exists() else f"{f} | MISSING")
if not (MAP_COM.exists() and SUM_COM.exists()):
    raise FileNotFoundError("Missing community artifacts. Run cell3g first.")

map_df = pd.read_parquet(MAP_COM)    # token, cluster_id, possibly degree/wdegree from earlier
sum_df = pd.read_parquet(SUM_COM)    # cluster_id, size, medoid, medoid_deg, medoid_wdeg, ...

HAS_EDGES = EDGES_USED.exists()
p(f"{EDGES_USED} | exists={HAS_EDGES}")
edges = pd.read_parquet(EDGES_USED) if HAS_EDGES else None

p(f"clusters={len(sum_df):,} | tokens={len(map_df):,}")

# ----------------- Token structural stats (giant only) -----------------
# Recompute degree & mean_sim from edges for nodes present (covers giant-portion).
p("============================================================")
p("üîß BUILD STRUCTURAL STATS FROM EDGES (giant portion)")
p("============================================================")
deg = defaultdict(int)
sim_sum = defaultdict(float)

if HAS_EDGES:
    for a, b, s in zip(edges["token_a"], edges["token_b"], edges["cosine_sim"]):
        s = float(s)
        deg[a] += 1; sim_sum[a] += s
        deg[b] += 1; sim_sum[b] += s

# Merge structural stats into map_df; keep previous columns as fallback
if "degree" not in map_df.columns: map_df["degree"] = 0
if "wdegree" not in map_df.columns: map_df["wdegree"] = 0.0

map_df["_deg2"] = map_df["token"].map(lambda t: deg.get(t, 0)).astype(np.int32)
map_df["_msim2"] = map_df["token"].map(lambda t: (sim_sum.get(t, 0.0) / max(1, deg.get(t, 0)))).astype(np.float32)

# Combine: prefer recomputed if available
map_df["_deg"]  = map_df[["_deg2","degree"]].max(axis=1)
map_df["_msim"] = map_df[["_msim2","wdegree"]].max(axis=1)

# Normalize per cluster later
# --------------------- Quality score function -------------------------
def token_quality_score(token: str, deg_norm: float, mean_sim_norm: float) -> float:
    """Combine structure + string quality into a single score in [0,1+]."""
    cw = content_words(token)
    cw_bonus = QUALITY.CW_BONUS_W * clamp(len(cw)/4.0, 0.0, 1.0)  # up to +0.20 for ‚â•4 content words
    # penalties
    pen = 0.0
    pen += QUALITY.DIGIT_PENALTY * digit_ratio(token)
    pen += QUALITY.SHORT_PENALTY * (1.0 if len(token) < 4 else 0.0)
    pen += QUALITY.ZZ_PENALTY * (1.0 if is_bad_prefix(token) else 0.0)
    pen += QUALITY.NONALPHA_PENALTY * (1.0 - alpha_ratio(token))
    base = QUALITY.DEG_W * deg_norm + QUALITY.MEAN_SIM_W * mean_sim_norm
    return clamp(base + cw_bonus - pen, -1.0, 2.0)

# --------------------- Per-cluster rescoring --------------------------
p("============================================================")
p("üöÄ COMPUTE NEW CANONICAL LABELS (quality re-score + synthetic labels)")
p("============================================================")
rows_sum = []
token_rows = []

# Pre-group to speed up
grp = map_df.groupby("cluster_id", sort=False)
N = len(sum_df)

t0 = time.time()
for idx, (cid, g) in enumerate(grp, start=1):
    # Normalize deg/msim inside the cluster for fairness
    d = g["_deg"].to_numpy(dtype=np.float32, copy=False)
    m = g["_msim"].to_numpy(dtype=np.float32, copy=False)
    d_norm = (d - d.min()) / (d.max() - d.min() + 1e-9)
    m_norm = (m - m.min()) / (m.max() - m.min() + 1e-9)

    # Score tokens
    scores = []
    for tok, dn, mn in zip(g["token"].tolist(), d_norm.tolist(), m_norm.tolist()):
        scores.append((tok, token_quality_score(tok, dn, mn)))
    # Best token candidate
    best_tok, best_score = max(scores, key=lambda x: x[1])

    # Synthetic label: top content words across cluster
    cw_counts = Counter()
    for tok in g["token"].tolist():
        cw_counts.update(content_words(tok))
    synth_words = [w for w,_ in cw_counts.most_common(SYNTH_TOP_K_WORDS)]
    synth_words = [w for w in synth_words if len(w) >= 2][:SYNTH_TOP_K_WORDS]
    synth_label = " ".join(synth_words[:SYNTH_TOP_K_WORDS])[:MAX_LABEL_LEN]
    use_synth = False

    # Decide final label
    if PREFER_SYNTH_IF_LOW_QUALITY:
        # If best token looks low-quality, prefer synthetic if it has enough words
        if (best_score < LOW_QUALITY_SCORE_CUTOFF or alpha_ratio(best_tok) < ALPHA_RATIO_MIN_FOR_TOKEN_LABEL) and len(synth_words) >= SYNTH_MIN_WORDS:
            use_synth = True

    final_label = synth_label if use_synth and synth_label else best_tok
    label_source = "synthetic" if (final_label == synth_label and final_label) else "token"

    # Build summary row
    prev_row = sum_df.loc[sum_df["cluster_id"] == cid]
    prev_medoid = prev_row["medoid"].iloc[0] if len(prev_row) else None
    rows_sum.append({
        "cluster_id": int(cid),
        "size": int(len(g)),
        "prev_medoid": prev_medoid,
        "new_label": final_label,
        "label_source": label_source,
        "best_token_candidate": best_tok,
        "best_token_score": float(best_score),
        "uniq_cw": int(len(cw_counts)),
    })

    # Assign to all tokens in cluster
    tok_df = g[["token","cluster_id"]].copy()
    tok_df["canonical_label"] = final_label
    token_rows.append(tok_df)

    if idx % 1000 == 0:
        p(f"[{idx}/{N}] clusters processed...")

elapsed = time.time() - t0
p(f"Done. Clusters processed: {N:,} | time={elapsed:.2f}s")

sum_v2 = pd.DataFrame(rows_sum).sort_values(["size","new_label"], ascending=[False, True])
canon_v2 = pd.concat(token_rows, ignore_index=True)

# ------------------------ Save artifacts -------------------------
p("============================================================")
p("üíæ SAVING (v2 canonicalization)")
p("============================================================")
sum_v2.to_parquet(SUM_V2, index=False)
canon_v2.to_parquet(CANON_V2_PQ, index=False)
canon_v2.to_csv(CANON_V2_CSV, index=False)
p(f"‚úÖ Summary (v2): {SUM_V2}")
p(f"‚úÖ Canon map v2 (parquet): {CANON_V2_PQ}")
p(f"‚úÖ Canon map v2 (csv):     {CANON_V2_CSV}")

# ------------------------ Prints & QA ----------------------------
p("============================================================")
p("üìà SUMMARY")
p("============================================================")
n_clusters = sum_v2["cluster_id"].nunique()
n_tokens = canon_v2["token"].nunique()
n_labels = sum_v2["new_label"].nunique()
p(f"Clusters: {n_clusters:,} | Tokens: {n_tokens:,} | Unique canonical labels: {n_labels:,}")
p("Top 15 largest clusters (new labels):")
p(sum_v2.head(15)[["cluster_id","size","new_label","label_source","best_token_candidate","best_token_score"]].to_string(index=False, max_colwidth=60))

# Compare with previous canonical map (if exists)
prev_map_pq = OUT / "token_canonical_map.community.parquet"
if prev_map_pq.exists():
    prev = pd.read_parquet(prev_map_pq)[["token","cluster_id","canonical_label"]].rename(columns={"canonical_label":"old_label"})
    merged = canon_v2.merge(prev, on=["token","cluster_id"], how="left")
    changed = merged[merged["canonical_label"] != merged["old_label"]]
    p(f"\nChanged labels: {len(changed):,} / {len(merged):,} ({len(changed)/max(1,len(merged)):.1%})")

    # Show 200 random before/after rewrites
    p("\n============================================================")
    p("üîç SAMPLE REWRITES (200) ‚Äî token ‚Üí old_label  //  new_label")
    p("============================================================")
    sample_n = min(200, len(changed)) if len(changed) else min(200, len(merged))
    samp = (changed if len(changed) else merged).sample(sample_n, random_state=42)
    out = samp[["token","old_label","canonical_label","cluster_id"]].rename(columns={"canonical_label":"new_label"})
    out = out.sort_values(["cluster_id","new_label","token"]).reset_index(drop=True)
    print(out.to_string(index=False, max_colwidth=80))
else:
    # No previous map; show straight 200 rewrites (token -> label)
    p("\n============================================================")
    p("üîç SAMPLE REWRITES (200) ‚Äî token ‚Üí new_label")
    p("============================================================")
    sample_n = min(200, len(canon_v2))
    samp = canon_v2.sample(sample_n, random_state=42).sort_values(["cluster_id","canonical_label","token"])
    print(samp.to_string(index=False, max_colwidth=80))

p("\n============================================================")
p("NEXT ACTION HINTS")
p("============================================================")
p("- If labels still look noisy, raise penalties (DIGIT_PENALTY, ZZ_PENALTY) or require ALPHA_RATIO_MIN_FOR_TOKEN_LABEL‚âà0.75.")
p("- Increase SYNTH_MIN_WORDS to 3 to force more descriptive synthetic labels on tiny communities.")
p("- Optional next: merge communities by new-label embeddings (thr‚âà0.85) to reduce 35k ‚Üí fewer macro-topics.")

üîé INPUTS
romance-novel-nlp-research/src/eda_analysis/outputs/clusters_token_map.community.parquet | exists=True | size=4.68 MB
romance-novel-nlp-research/src/eda_analysis/outputs/clusters_summary.community.parquet | exists=True | size=1.48 MB
romance-novel-nlp-research/src/eda_analysis/outputs/giant_comm_edges_used.parquet | exists=True
clusters=42,784 | tokens=232,918
üîß BUILD STRUCTURAL STATS FROM EDGES (giant portion)
üöÄ COMPUTE NEW CANONICAL LABELS (quality re-score + synthetic labels)
[1000/42784] clusters processed...
[2000/42784] clusters processed...
[3000/42784] clusters processed...
[4000/42784] clusters processed...
[5000/42784] clusters processed...
[6000/42784] clusters processed...
[7000/42784] clusters processed...
[8000/42784] clusters processed...
[9000/42784] clusters processed...
[10000/42784] clusters processed...
[11000/42784] clusters processed...
[12000/42784] clusters processed...
[13000/42784] clusters processed...
[14000/42784] clusters processed...
[15

In [20]:
# romance-novel-nlp-research/src/eda_analysis/cell3i_year_blind_merge_labels.py
"""
Year-blind community merge:
- Merge communities whose labels become equal after normalization.
- Keep numeric semantics when digits are meaningful (pages, ordinals, "top 100", etc).
- Rebuild token‚Üícanonical map; print before/after stats and 200 rewrites.

Why: unify ephemeral-year shelves ("to read 2013/2014") without wrecking numeric topics.
"""

from __future__ import annotations
import re, sys, time, hashlib
from pathlib import Path
from typing import Dict, List, Tuple
from collections import defaultdict, Counter

import numpy as np
import pandas as pd

OUT = Path("romance-novel-nlp-research/src/eda_analysis/outputs")
SUM_V2    = OUT / "clusters_summary.community.v2.parquet"
CANON_V2  = OUT / "token_canonical_map.community.v2.parquet"
MAP_COMM  = OUT / "clusters_token_map.community.parquet"  # for tokens‚Üíold cluster ids

# Outputs
MAP_V3    = OUT / "clusters_token_map.community.v3.parquet"
SUM_V3    = OUT / "clusters_summary.community.v3.parquet"
CANON_V3P = OUT / "token_canonical_map.community.v3.parquet"
CANON_V3C = OUT / "token_canonical_map.community.v3.csv"

# ----------------------- Config knobs -----------------------
NORMALIZE_MODE = "year_blind"     # "year_blind" | "strip_all_digits"
KEEP_NUMERIC_TOPICS = True        # don't strip digits if label mentions pages/pgs/ordinals/top lists, etc.
SYNTH_LABEL_TOP_K = 3             # top content words to form synthetic label
REWRITE_SAMPLE_N = 200

YEAR_RE   = re.compile(r"\b(?:19|20)\d{2}\b")
DIGIT_RE  = re.compile(r"\d+")
TOKEN_RE  = re.compile(r"[a-z]{2,}")
SPC_RE    = re.compile(r"\s{2,}")

# patterns where numbers matter -> preserve
MEANINGFUL_NUM_PATTERNS = [
    r"\bpage(s)?\b", r"\bpgs?\b", r"\bword(s)?\b", r"\bchapter(s)?\b",
    r"\bbook(s)?\b\s*\d+", r"\bvol(ume)?\b\s*\d+", r"\bpart\s*\d+",
    r"\btop\s*\d+\b", r"\b\d+\s*(to|-)\s*\d+\b",  # ranges
    r"\b\d{1,3}%\b", r"\bstar(s)?\b", r"\bseries\s*\d+\b"
]
MEANINGFUL_NUM_RE = re.compile("|".join(MEANINGFUL_NUM_PATTERNS), re.IGNORECASE)

def p(x: str) -> None:
    print(x); sys.stdout.flush()

def content_words(s: str) -> List[str]:
    return TOKEN_RE.findall(s.lower())

def normalize_label(text: str, mode: str) -> str:
    """Remove only years by default; optional full digit strip with guards."""
    x = text.strip()
    # If digits are meaningful, return early when KEEP_NUMERIC_TOPICS
    if KEEP_NUMERIC_TOPICS and MEANINGFUL_NUM_RE.search(x or ""):
        y = YEAR_RE.sub("", x)
        y = SPC_RE.sub(" ", y).strip()
        return y.lower()
    if mode == "year_blind":
        y = YEAR_RE.sub("", x)
    elif mode == "strip_all_digits":
        y = DIGIT_RE.sub("", x)
    else:
        y = x
    y = SPC_RE.sub(" ", y).strip()
    return y.lower()

def key_from_label(text: str, mode: str) -> str:
    """Key is sorted content-words after normalization; stable for exact grouping."""
    norm = normalize_label(text, mode)
    toks = [t for t in content_words(norm) if t]  # already lower
    if not toks:
        return ""  # avoid merging empties
    toks = sorted(set(toks))
    return " ".join(toks)

def synth_label_from_key(key: str, k: int) -> str:
    if not key: return ""
    toks = key.split()
    # prefer frequent-ish words first: here key is set; keep lexicographic for stability
    return " ".join(toks[:k])

def stable_new_cluster_id(key: str) -> int:
    """Deterministic 64-bit id from key (avoid collisions across runs)."""
    h = hashlib.blake2b(key.encode("utf-8"), digest_size=8).hexdigest()
    return int(h, 16)

def main() -> None:
    # ---------- Load ----------
    for f in (SUM_V2, CANON_V2, MAP_COMM):
        if not f.exists():
            raise FileNotFoundError(f"Missing input: {f}")
    sum_v2   = pd.read_parquet(SUM_V2)      # cluster_id, size, new_label, ...
    canon_v2 = pd.read_parquet(CANON_V2)    # token, cluster_id, canonical_label
    map_comm = pd.read_parquet(MAP_COMM)    # token, cluster_id (pre-v2, but ids match v2 clusters)

    # ---------- Prep ----------
    p("============================================================")
    p("üîé INPUTS")
    p("============================================================")
    p(f"clusters={sum_v2['cluster_id'].nunique():,} | tokens={canon_v2['token'].nunique():,}")
    p(f"normalize_mode={NORMALIZE_MODE} | keep_numeric_topics={KEEP_NUMERIC_TOPICS}")

    # Build merge keys from v2 labels
    t0 = time.time()
    sum_v2 = sum_v2.copy()
    sum_v2["merge_key"] = sum_v2["new_label"].astype(str).map(lambda s: key_from_label(s, NORMALIZE_MODE))
    # Guard: don't merge empty keys
    sum_v2["merge_key"] = sum_v2["merge_key"].fillna("").astype(str)
    empty_keys = int((sum_v2["merge_key"] == "").sum())
    if empty_keys:
        p(f"‚ö†Ô∏è Empty merge keys: {empty_keys:,} (will preserve as separate clusters)")
    # Group clusters by key
    grp = sum_v2.groupby("merge_key", dropna=False)
    groups = {k: g["cluster_id"].tolist() for k, g in grp}
    # Stats
    multi = {k:v for k,v in groups.items() if k and len(v) > 1}
    p(f"Candidate groups: {len(groups):,} | multi-merge groups: {len(multi):,} | time={time.time()-t0:.2f}s")

    # ---------- Build old‚Üínew cluster id map ----------
    old_to_new: Dict[int, int] = {}
    key_to_label: Dict[str, str] = {}
    rows_sum = []

    for key, cids in groups.items():
        # new id: stable per key; for empty key, keep original ids
        if not key or len(cids) == 1:
            # identity mapping for singles/empties
            for cid in cids:
                old_to_new[int(cid)] = int(cid)
            # label = existing
            lab = sum_v2.loc[sum_v2["cluster_id"] == cids[0], "new_label"].iloc[0]
            key_to_label[key] = lab
            continue

        new_cid = stable_new_cluster_id(key)
        for cid in cids:
            old_to_new[int(cid)] = int(new_cid)
        # synthetic label from key
        synth = synth_label_from_key(key, SYNTH_LABEL_TOP_K)
        # fallback: most frequent existing label in the group
        if not synth:
            counts = Counter(sum_v2.loc[sum_v2["cluster_id"].isin(cids), "new_label"].astype(str))
            synth = counts.most_common(1)[0][0]
        key_to_label[key] = synth

    # ---------- Reassign token map ----------
    t1 = time.time()
    map_v3 = map_comm[["token","cluster_id"]].copy()
    map_v3["cluster_id_old"] = map_v3["cluster_id"].astype("int64")
    map_v3["cluster_id"] = map_v3["cluster_id_old"].map(lambda c: old_to_new.get(int(c), int(c))).astype("int64")

    # ---------- Build v3 summary ----------
    # Size per new cluster
    sz = map_v3.groupby("cluster_id").size().rename("size").reset_index()
    # Derive label per new cluster
    # Map new cluster_id back to key to get label
    # Build key‚Üínew_id (skip empties handled by identity)
    key_to_newid = {k: stable_new_cluster_id(k) for k in groups.keys() if k and len(groups[k]) > 1}
    # For clusters not merged: keep original label from v2
    # First, build cid‚Üílabel from v2 (new_label)
    cid2label_v2 = dict(zip(sum_v2["cluster_id"].astype("int64"), sum_v2["new_label"].astype(str)))
    labels = []
    for cid in sz["cluster_id"].astype("int64"):
        # if this cid is one of the merged new ids, find its key label
        key = None
        for k, nid in key_to_newid.items():
            if nid == cid:
                key = k; break
        if key is not None:
            labels.append(key_to_label.get(key, synth_label_from_key(key, SYNTH_LABEL_TOP_K)))
        else:
            labels.append(cid2label_v2.get(int(cid), ""))

    sum_v3 = sz.copy()
    sum_v3["label"] = labels

    # ---------- Build canonical map v3 ----------
    # token -> label via cluster_id
    cid2label_v3 = dict(zip(sum_v3["cluster_id"].astype("int64"), sum_v3["label"].astype(str)))
    canon_v3 = map_v3[["token","cluster_id"]].copy()
    canon_v3["canonical_label"] = canon_v3["cluster_id"].map(lambda c: cid2label_v3.get(int(c), ""))

    # ---------- Save ----------
    map_v3.drop(columns=["cluster_id_old"], inplace=True)
    map_v3.to_parquet(MAP_V3, index=False)
    sum_v3.to_parquet(SUM_V3, index=False)
    canon_v3.to_parquet(CANON_V3P, index=False)
    canon_v3.to_csv(CANON_V3C, index=False)

    # ---------- Prints ----------
    p("============================================================")
    p("üíæ SAVED")
    p("============================================================")
    p(f"Token map (v3): {MAP_V3}")
    p(f"Summary   (v3): {SUM_V3}")
    p(f"Canon map (v3): {CANON_V3P}")
    p(f"Canon csv (v3): {CANON_V3C}")

    # Before/after stats
    n_before = sum_v2["cluster_id"].nunique()
    n_after  = sum_v3["cluster_id"].nunique()
    merged_groups = sum(1 for v in groups.values() if len(v) > 1)
    total_merged_clusters = sum(len(v)-1 for v in groups.values() if len(v) > 1)
    p("============================================================")
    p("üìà MERGE SUMMARY")
    p("============================================================")
    p(f"Before communities : {n_before:,}")
    p(f"After communities  : {n_after:,}")
    p(f"Merging groups     : {merged_groups:,}")
    p(f"Clusters collapsed : {total_merged_clusters:,}")

    # Show top 20 merged keys
    merged_key_sizes = [(k, len(v)) for k,v in groups.items() if k and len(v) > 1]
    merged_key_sizes.sort(key=lambda x: x[1], reverse=True)
    p("\nTop 20 merge keys (size, key -> sample labels):")
    for k, szk in merged_key_sizes[:20]:
        labs = sum_v2.loc[sum_v2["merge_key"] == k, "new_label"].head(5).tolist()
        p(f"- {szk:>3} | {k} -> {labs}")

    # Year-only examples
    p("\nExamples merged by year-blind (up to 30):")
    examples = []
    for k, cids in groups.items():
        if not k or len(cids) <= 1: continue
        labs = sum_v2.loc[sum_v2["cluster_id"].isin(cids), "new_label"].astype(str).tolist()
        if any(YEAR_RE.search(l) for l in labs):
            examples.append((k, labs[:6]))
        if len(examples) >= 30: break
    for k, labs in examples:
        p(f"* {k}  ::  {labs}")

    # 200 rewrites: token ‚Üí old_label // new_label
    p("\n============================================================")
    p("üîç SAMPLE REWRITES (200) ‚Äî token ‚Üí old_label  //  new_label")
    p("============================================================")
    old = pd.read_parquet(CANON_V2)[["token","cluster_id","canonical_label"]].rename(columns={"canonical_label":"old_label"})
    new = canon_v3[["token","cluster_id","canonical_label"]].rename(columns={"canonical_label":"new_label"})
    merged = old.merge(new, on=["token","cluster_id"], how="outer", indicator=True).fillna({"old_label":"","new_label":""})
    changed = merged[merged["old_label"] != merged["new_label"]]
    samp = (changed if len(changed) else merged).sample(min(REWRITE_SAMPLE_N, len(merged)), random_state=42)
    out = samp[["token","old_label","new_label","cluster_id"]].sort_values(["cluster_id","new_label","token"])
    print(out.to_string(index=False, max_colwidth=80))

if __name__ == "__main__":
    main()

üîé INPUTS
clusters=35,115 | tokens=232,918
normalize_mode=year_blind | keep_numeric_topics=True
‚ö†Ô∏è Empty merge keys: 154 (will preserve as separate clusters)
Candidate groups: 33,746 | multi-merge groups: 637 | time=4.20s
üíæ SAVED
Token map (v3): romance-novel-nlp-research/src/eda_analysis/outputs/clusters_token_map.community.v3.parquet
Summary   (v3): romance-novel-nlp-research/src/eda_analysis/outputs/clusters_summary.community.v3.parquet
Canon map (v3): romance-novel-nlp-research/src/eda_analysis/outputs/token_canonical_map.community.v3.parquet
Canon csv (v3): romance-novel-nlp-research/src/eda_analysis/outputs/token_canonical_map.community.v3.csv
üìà MERGE SUMMARY
Before communities : 35,115
After communities  : 33,899
Merging groups     : 638
Clusters collapsed : 1,369

Top 20 merge keys (size, key -> sample labels):
-  26 | read -> ['read m', 'read c', 'read read', '2 read read', 'g read']
-  23 | series -> ['series13', 'series g', 'b a d series', 'b series', 's a s s se