# Topic Quality EDA and Noisy Topic Detection

This notebook performs exploratory data analysis (EDA) on BERTopic model topics and flags candidate noisy topics for manual inspection before proceeding with LLM-generated topic labeling.

## Overview
- **Input**: Retrained BERTopic model (wrapper or native)
- **Analysis**: Topic size, POS representation stats, per-topic POS coherence
- **Output**: Topic quality table with noise candidate flags and inspection labels
- **Purpose**: Identify noisy topics before LLM labeling stage

## Key Features
- Keeps the model as-is (no topics removed)
- EDA on POS representation (size, POS length, per-topic POS coherence)
- Flags candidate noisy topics with label + reason for manual inspection
- Uses the same loading & batching logic as `explore_retrained_model.py`

## Model Loading Strategy

**Default: Wrapper (`model_*.pkl`)** - Recommended for EDA
- ✅ Contains exact training dataset (`wrapper.dataset_as_list_of_strings`)
- ✅ Guaranteed to use same text the model was trained on
- ✅ No mismatch between training corpus and EDA corpus
- ✅ Works seamlessly with `prepare_documents(wrapper, ...)`

**Alternative: Native BERTopic (`model_{rank}/` directory)**
- Use `USE_NATIVE = True` if you prefer standard BERTopic serialization
- Requires providing `DATASET_CSV` separately
- Better for sharing/deployment, but requires ensuring dataset matches training data


## Setup Instructions

**Important**: Always run this notebook with the project's virtual environment activated.

```bash
# Activate venv before launching Jupyter
source venv/bin/activate  # On Linux/Mac
# or
venv\Scripts\activate  # On Windows

# Then launch Jupyter
jupyter notebook notebooks/06_labeling/topic_quality_eda.ipynb
```

## Cell 1: Verify venv and imports & logging setup

In [1]:
# Cell 1: Verify venv and imports & logging setup

import sys
from pathlib import Path

# Verify we're using the venv Python
venv_path = Path.cwd().parent.parent / "venv"
if venv_path.exists():
    expected_prefix = str(venv_path.resolve())
    if expected_prefix not in sys.prefix:
        print(f"⚠️  WARNING: Not using venv!")
        print(f"   Current Python: {sys.executable}")
        print(f"   Expected venv: {venv_path}")
        print(f"   Please activate venv: source venv/bin/activate")
    else:
        print(f"✓ Using venv: {sys.executable}")
else:
    print(f"⚠️  WARNING: venv directory not found at {venv_path}")

import logging
import pickle

import pandas as pd
from gensim.models import CoherenceModel

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

from src.stage06_exploration.explore_retrained_model import (
    LOGGER,
    DEFAULT_BASE_DIR,
    DEFAULT_EMBEDDING_MODEL,
    DEFAULT_BATCH_SIZE,
    DEFAULT_CHAPTERS_CSV,
    DEFAULT_CHAPTERS_SUBSET_CSV,
    DEFAULT_CORPUS_PATH,
    load_retrained_wrapper,
    load_native_bertopic_model,
    prepare_documents,
    load_dictionary_from_corpus,
    extract_all_topics,
    backup_existing_file,
    stage_timer,
)

from src.stage06_eda.topic_quality_analysis import (
    build_topic_quality_table,
    apply_noise_labels_to_model,
)

# Ensure INFO-level logs from our logger
LOGGER.setLevel(logging.INFO)
print("Logger name:", LOGGER.name)
print(f"Project root: {project_root}")


✓ Using venv: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/venv/bin/python


  from .autonotebook import tqdm as notebook_tqdm


[2025-12-05 16:27:54.798] [CUML] [info] build_algo set to brute_force_knn because random_state is given
✅ RAPIDS (cuML) is available and functional
Logger name: stage06_topics_exploration
Project root: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor


## Cell 2: Configuration (paths, thresholds, etc.)


In [2]:
# Cell 2: Configuration
# 
# This configuration matches stage06_exploration defaults:
# - Uses the saved model at models/retrained/paraphrase-MiniLM-L6-v2/model_1.pkl
# - Loads the same dataset that was used for training (from wrapper cache)
# - Uses the same dictionary corpus as stage06_exploration

# Model configuration - matches stage06_exploration defaults
BASE_DIR = DEFAULT_BASE_DIR   # models/retrained
EMBEDDING_MODEL = DEFAULT_EMBEDDING_MODEL  # "paraphrase-MiniLM-L6-v2"
PARETO_RANK = 1

# Model loading strategy:
# USE_NATIVE = False (default): Load wrapper pickle - RECOMMENDED for EDA
#   - Contains exact training dataset (wrapper.dataset_as_list_of_strings)
#   - Guaranteed to use same text the model was trained on
#   - No mismatch between training corpus and EDA corpus
# USE_NATIVE = True: Load native BERTopic directory
#   - Standard BERTopic serialization (more portable)
#   - Requires providing DATASET_CSV separately
#   - Better for sharing/deployment, but need to ensure dataset matches training data
USE_NATIVE = False  # False => load wrapper pickle (contains training dataset)

# Dataset configuration
# IMPORTANT: USE_WRAPPER_DOCS_ONLY = True ensures we use the exact same dataset
# that was used for training (stored in the wrapper's dataset_as_list_of_strings)
# 
# If USE_NATIVE = True, set USE_WRAPPER_DOCS_ONLY = False and provide DATASET_CSV
USE_WRAPPER_DOCS_ONLY = True     # True => use docs from wrapper (training dataset)
FALLBACK_DATASET = "chapters"    # "chapters" or "subset" (only used if wrapper unavailable)
DATASET_CSV = None               # explicit Path if you want to override (required if USE_NATIVE=True)

if DATASET_CSV is None and not USE_WRAPPER_DOCS_ONLY:
    if FALLBACK_DATASET == "subset":
        DATASET_CSV = DEFAULT_CHAPTERS_SUBSET_CSV
    else:
        DATASET_CSV = DEFAULT_CHAPTERS_CSV

# Dictionary / corpus configuration
# This is the OCTIS corpus TSV used in Stage 06 (same as stage06_exploration)
DICTIONARY_CORPUS_PATH = DEFAULT_CORPUS_PATH
# OPTIONAL: if you have a previously saved Dictionary, put its path here
DICTIONARY_PICKLE_PATH = None  # e.g. Path("results/dictionary.pkl")

BATCH_SIZE = DEFAULT_BATCH_SIZE  # 50,000 (same as stage06_exploration)
LIMIT_DOCS = None  # e.g. 100_000 for experiments, or None to use all

# Topic extraction / EDA configuration
TOP_K = 10  # number of top words per topic to consider
MIN_TOPIC_SIZE = 30         # docs per topic
MIN_POS_WORDS = 3           # POS words per topic
MIN_POS_COHERENCE = 0.0     # per-topic POS coherence threshold

# Output paths
OUTPUT_DIR = project_root / "results" / "stage06_eda"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Will be used for filenames
MODEL_NAME_SAFE = EMBEDDING_MODEL.replace("/", "_").replace("\\", "_")

# Verify model file exists
model_pickle_path = BASE_DIR / EMBEDDING_MODEL / f"model_{PARETO_RANK}.pkl"
print("=" * 80)
print("Configuration Summary")
print("=" * 80)
print(f"BASE_DIR: {BASE_DIR}")
print(f"EMBEDDING_MODEL: {EMBEDDING_MODEL}")
print(f"PARETO_RANK: {PARETO_RANK}")
print(f"Model pickle path: {model_pickle_path}")
print(f"Model exists: {model_pickle_path.exists()}")
print(f"Using native BERTopic: {USE_NATIVE} (False = load wrapper with training dataset)")
print(f"USE_WRAPPER_DOCS_ONLY: {USE_WRAPPER_DOCS_ONLY} (True = use training dataset from wrapper)")
print(f"Dataset CSV (fallback): {DATASET_CSV}")
print(f"Dictionary corpus path: {DICTIONARY_CORPUS_PATH}")
print(f"Dictionary pickle path: {DICTIONARY_PICKLE_PATH}")
print(f"Output dir: {OUTPUT_DIR}")
print("=" * 80)


Configuration Summary
BASE_DIR: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained
EMBEDDING_MODEL: paraphrase-MiniLM-L6-v2
PARETO_RANK: 1
Model pickle path: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1.pkl
Model exists: True
Using native BERTopic: False (False = load wrapper with training dataset)
USE_WRAPPER_DOCS_ONLY: True (True = use training dataset from wrapper)
Dataset CSV (fallback): None
Dictionary corpus path: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/data/interim/octis/corpus.tsv
Dictionary pickle path: None
Output dir: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/results/stage06_eda


## Cell 3: Load model (wrapper / native) with diagnostics


In [3]:
# Cell 3: Load retrained BERTopic model and metadata

from src.stage06_exploration.explore_retrained_model import load_metadata
import json

# Load metadata first to see model info
metadata = load_metadata(
    base_dir=BASE_DIR,
    embedding_model=EMBEDDING_MODEL,
    pareto_rank=PARETO_RANK,
)

if metadata:
    print("=" * 80)
    print("Model Metadata")
    print("=" * 80)
    print(json.dumps(metadata, indent=2))
    print("=" * 80)
else:
    print("⚠️  No metadata file found (this is okay, model will still load)")

wrapper = None

with stage_timer("Loading retrained model"):
    if USE_NATIVE:
        topic_model = load_native_bertopic_model(
            base_dir=BASE_DIR,
            embedding_model=EMBEDDING_MODEL,
            pareto_rank=PARETO_RANK,
        )
    else:
        wrapper, topic_model = load_retrained_wrapper(
            base_dir=BASE_DIR,
            embedding_model=EMBEDDING_MODEL,
            pareto_rank=PARETO_RANK,
        )

print("\n" + "=" * 80)
print("Model Loading Summary")
print("=" * 80)
print(f"Wrapper loaded: {wrapper is not None}")
if wrapper:
    print(f"✓ Using training dataset from wrapper (exact same as used for training)")
    print(f"  Wrapper has dataset_as_list_of_strings: {hasattr(wrapper, 'dataset_as_list_of_strings')}")
    if hasattr(wrapper, 'dataset_as_list_of_strings'):
        print(f"  Number of documents in wrapper: {len(wrapper.dataset_as_list_of_strings)}")
print(f"topic_model type: {type(topic_model)}")
print("=" * 80)

# Diagnostics similar to explore_retrained_model.main()
LOGGER.info("=== Model Diagnostics (Loaded State) ===")
if hasattr(topic_model, "topic_representations_") and topic_model.topic_representations_:
    topic_ids = [tid for tid in topic_model.topic_representations_.keys() if tid != -1]
    LOGGER.info("Topic IDs in topic_representations_: %s", sorted(topic_ids)[:20])
    LOGGER.info("Total topics (excluding -1): %d", len(topic_ids))

if hasattr(topic_model, "get_topic_info"):
    try:
        topic_info = topic_model.get_topic_info()
        LOGGER.info("Topic info shape: %s", getattr(topic_info, "shape", "N/A"))
        LOGGER.info("Topic info columns: %s", getattr(topic_info, "columns", "N/A"))
    except Exception as e:
        LOGGER.warning("Could not get topic_info: %s", e)


[2025-12-05 16:27:55,392] [INFO] ▶ Loading retrained model | start
[2025-12-05 16:27:55,393] [INFO] ▶ Loading pickle wrapper: model_1.pkl | start


Model Metadata
{
  "embedding_model": "paraphrase-MiniLM-L6-v2",
  "pareto_rank": 1,
  "hyperparameters": {
    "bertopic__min_topic_size": 64,
    "bertopic__top_n_words": 27,
    "hdbscan__min_cluster_size": 143,
    "hdbscan__min_samples": 32,
    "umap__min_dist": 0.08570171053,
    "umap__n_components": 9,
    "umap__n_neighbors": 44,
    "vectorizer__min_df": 0.005931605066
  },
  "coherence": 0.4252370503,
  "topic_diversity": 0.94,
  "combined_score": 1.6505950772167353,
  "iteration": 19,
  "num_topics": 9936,
  "training_timestamp": "2025-11-29T18:00:22.427871"
}


[2025-12-05 16:28:05,822] [INFO] ■ Loading pickle wrapper: model_1.pkl | completed in 10.43 s
[2025-12-05 16:28:05,824] [INFO] ■ Loading retrained model | completed in 10.43 s
[2025-12-05 16:28:05,825] [INFO] === Model Diagnostics (Loaded State) ===
[2025-12-05 16:28:05,826] [INFO] Topic IDs in topic_representations_: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[2025-12-05 16:28:05,827] [INFO] Total topics (excluding -1): 368
[2025-12-05 16:28:05,881] [INFO] Topic info shape: (369, 8)
[2025-12-05 16:28:05,882] [INFO] Topic info columns: Index(['Topic', 'Count', 'Name', 'Representation', 'KeyBERT', 'MMR', 'POS',
       'Representative_Docs'],
      dtype='object')



Model Loading Summary
Wrapper loaded: True
✓ Using training dataset from wrapper (exact same as used for training)
  Wrapper has dataset_as_list_of_strings: True
  Number of documents in wrapper: 680822
topic_model type: <class 'bertopic._bertopic.BERTopic'>


## Cell 4: Load documents (batched, same as Stage 06)


In [4]:
# Cell 4: Load documents & tokens in batches (wrapper or CSV)
# 
# IMPORTANT: If wrapper is loaded, this uses the EXACT same dataset that was used for training.
# The wrapper stores the training dataset in dataset_as_list_of_strings.

if USE_WRAPPER_DOCS_ONLY and wrapper is None:
    raise ValueError("USE_WRAPPER_DOCS_ONLY=True but wrapper is None. Set USE_WRAPPER_DOCS_ONLY=False or load wrapper.")

dataset_csv = None
if wrapper is None:
    dataset_csv = DATASET_CSV
    if dataset_csv is None:
        raise ValueError("No wrapper and no DATASET_CSV provided.")

print("=" * 80)
print("Loading Documents")
print("=" * 80)
if wrapper and USE_WRAPPER_DOCS_ONLY:
    print("✓ Loading from wrapper (training dataset)")
else:
    print(f"⚠️  Loading from CSV: {dataset_csv}")

docs, docs_tokens = prepare_documents(
    wrapper if USE_WRAPPER_DOCS_ONLY else None,
    dataset_csv=dataset_csv,
    batch_size=BATCH_SIZE,
    limit=LIMIT_DOCS,
)

print("=" * 80)
print(f"✓ Loaded documents: {len(docs):,}")
print(f"✓ Loaded tokens:    {len(docs_tokens):,}")
if LIMIT_DOCS:
    print(f"  (Limited to {LIMIT_DOCS:,} documents)")
print("=" * 80)


[2025-12-05 16:28:05,897] [INFO] Wrapper docs in-memory | batch 1 => rows 1-50000 / 680822
[2025-12-05 16:28:05,899] [INFO] Wrapper docs in-memory | batch 2 => rows 50001-100000 / 680822
[2025-12-05 16:28:05,900] [INFO] Wrapper docs in-memory | batch 3 => rows 100001-150000 / 680822
[2025-12-05 16:28:05,901] [INFO] Wrapper docs in-memory | batch 4 => rows 150001-200000 / 680822
[2025-12-05 16:28:05,902] [INFO] Wrapper docs in-memory | batch 5 => rows 200001-250000 / 680822
[2025-12-05 16:28:05,903] [INFO] Wrapper docs in-memory | batch 6 => rows 250001-300000 / 680822
[2025-12-05 16:28:05,903] [INFO] Wrapper docs in-memory | batch 7 => rows 300001-350000 / 680822
[2025-12-05 16:28:05,904] [INFO] Wrapper docs in-memory | batch 8 => rows 350001-400000 / 680822
[2025-12-05 16:28:05,905] [INFO] Wrapper docs in-memory | batch 9 => rows 400001-450000 / 680822
[2025-12-05 16:28:05,905] [INFO] Wrapper docs in-memory | batch 10 => rows 450001-500000 / 680822
[2025-12-05 16:28:05,906] [INFO] Wra

Loading Documents
✓ Loading from wrapper (training dataset)
✓ Loaded documents: 680,822
✓ Loaded tokens:    680,822


## Cell 5: Load or build dictionary (with optional reuse)


In [5]:
# Cell 5: Load or build Gensim dictionary (streaming/batched)

from gensim.corpora import Dictionary

if DICTIONARY_PICKLE_PATH is not None and Path(DICTIONARY_PICKLE_PATH).is_file():
    with stage_timer(f"Loading Dictionary from pickle: {Path(DICTIONARY_PICKLE_PATH).name}"):
        with open(DICTIONARY_PICKLE_PATH, "rb") as f:
            dictionary: Dictionary = pickle.load(f)
        LOGGER.info("Loaded dictionary with %d tokens", len(dictionary))
else:
    dictionary = load_dictionary_from_corpus(
        corpus_path=DICTIONARY_CORPUS_PATH,
        batch_size=BATCH_SIZE,
    )
    # Optional: save for future runs
    # DICTIONARY_PICKLE_PATH = OUTPUT_DIR / f"dictionary_{MODEL_NAME_SAFE}.pkl"
    # with stage_timer(f"Saving Dictionary to {DICTIONARY_PICKLE_PATH.name}"):
    #     with open(DICTIONARY_PICKLE_PATH, "wb") as f:
    #         pickle.dump(dictionary, f)

print("Dictionary size:", len(dictionary))


[2025-12-05 16:28:05,926] [INFO] ▶ Dictionary build: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/data/interim/octis/corpus.tsv | start
[2025-12-05 16:28:05,927] [INFO] ▶ Dictionary corpus streaming: corpus.tsv | start
[2025-12-05 16:28:07,857] [INFO] Dictionary stream | processed 50000 rows
[2025-12-05 16:28:07,858] [INFO] adding document #0 to Dictionary<0 unique tokens: []>
[2025-12-05 16:28:08,342] [INFO] adding document #10000 to Dictionary<17137 unique tokens: ['e', 'h', 'prologue', 'tired.', 'was']...>
[2025-12-05 16:28:08,902] [INFO] adding document #20000 to Dictionary<25789 unique tokens: ['e', 'h', 'prologue', 'tired.', 'was']...>
[2025-12-05 16:28:09,386] [INFO] adding document #30000 to Dictionary<33185 unique tokens: ['e', 'h', 'prologue', 'tired.', 'was']...>
[2025-12-05 16:28:09,822] [INFO] adding document #40000 to Dictionary<38507 unique tokens: ['e', 'h', 'prologue', 'tired.', 'was']...>
[2025-12-05 16:28:10,231] [INFO]

Dictionary size: 185903


## Cell 6: Build topic quality table + noise candidate labels


In [6]:
# Cell 6: Build topic quality table and flag noisy candidates

with stage_timer("Running topic EDA + noise candidate labeling"):
    quality_df = build_topic_quality_table(
        topic_model,
        docs_tokens=docs_tokens,
        dictionary=dictionary,
        min_size=MIN_TOPIC_SIZE,
        min_pos_words=MIN_POS_WORDS,
        min_pos_coherence=MIN_POS_COHERENCE,
        top_k=TOP_K,
    )

print("Total topics (excl. -1):", len(quality_df))
print("Candidate noisy topics:", int(quality_df["noise_candidate"].sum()))

# Quick peek: worst topics by POS coherence
print("\n=== 20 topics with lowest POS coherence ===")
display(
    quality_df.sort_values("coherence_c_v_pos").head(20)[
        ["Topic", "Count", "coherence_c_v_pos", "n_pos_words", "noise_candidate", "noise_reason", "inspection_label"]
    ]
)

# Quick peek: candidate noisy topics
print("\n=== 20 candidate noisy topics (for manual inspection) ===")
display(
    quality_df[quality_df["noise_candidate"]].head(20)[
        ["Topic", "Count", "coherence_c_v_pos", "n_pos_words", "noise_reason", "inspection_label"]
    ]
)


[2025-12-05 16:28:51,284] [INFO] ▶ Running topic EDA + noise candidate labeling | start
[2025-12-05 16:28:51,286] [INFO] ▶ Building topic quality table | start
[2025-12-05 16:28:51,310] [INFO] ▶ Extracting all topic representations | start
[2025-12-05 16:28:51,310] [INFO] Extracting 'Main' representation from topic_representations_
[2025-12-05 16:28:51,317] [INFO] Extracted 363 topics for 'Main' representation
[2025-12-05 16:28:51,318] [INFO] Extracting additional aspects from topic_aspects_
[2025-12-05 16:28:51,323] [INFO] Extracted 360 topics for 'KeyBERT' representation
[2025-12-05 16:28:51,328] [INFO] Extracted 363 topics for 'MMR' representation
[2025-12-05 16:28:51,334] [INFO] Extracted 361 topics for 'POS' representation
[2025-12-05 16:28:51,334] [INFO] ■ Extracting all topic representations | completed in 0.02 s
[2025-12-05 16:28:51,343] [INFO] ▶ Extracting all topic representations | start
[2025-12-05 16:28:51,344] [INFO] Extracting 'Main' representation from topic_representat

Total topics (excl. -1): 368
Candidate noisy topics: 13

=== 20 topics with lowest POS coherence ===


Unnamed: 0,Topic,Count,coherence_c_v_pos,n_pos_words,noise_candidate,noise_reason,inspection_label
13,298,180,0.112269,10.0,False,,298_aren_jail_nightlife_means
14,340,155,0.116221,10.0,False,,340_reida_ha_mouths_samuela
15,304,178,0.118944,10.0,False,,304_jar_barista_trance_suggests
16,324,163,0.133797,10.0,False,,324_sensory_overreacting_ita_discussion
17,180,295,0.138756,10.0,False,,180_ode_deserve_believed_ended
18,222,246,0.141362,4.0,False,,222_yeah_benjamina_aftera_trial
19,346,152,0.142617,10.0,False,,346_clairea_admirable_passionate_unenthusiasti...
20,321,165,0.14815,10.0,False,,321_cilliana_antidote_maine_evon
21,190,282,0.148865,10.0,False,,190_et_religion_priest_deity
22,81,535,0.149391,10.0,False,,81_parents_children_kids_families



=== 20 candidate noisy topics (for manual inspection) ===


Unnamed: 0,Topic,Count,coherence_c_v_pos,n_pos_words,noise_reason,inspection_label
0,182,291,0.345404,2.0,few_pos<3,[NOISE_CANDIDATE:few_pos<3] 182_tough_umma_jer...
1,347,150,1.0,1.0,few_pos<3,[NOISE_CANDIDATE:few_pos<3] 347_yessss___
2,262,204,1.0,1.0,few_pos<3,[NOISE_CANDIDATE:few_pos<3] 262_imply_clever_h...
3,241,220,1.0,1.0,few_pos<3,[NOISE_CANDIDATE:few_pos<3] 241_okaya_wella_ar...
4,183,290,1.0,1.0,few_pos<3,[NOISE_CANDIDATE:few_pos<3] 183_division_stick...
5,141,360,1.0,1.0,few_pos<3,[NOISE_CANDIDATE:few_pos<3] 141_rush___
6,359,146,,,few_pos<3;low_coh<0.00,[NOISE_CANDIDATE:few_pos<3;low_coh<0.00] 359_p...
7,283,188,,,few_pos<3;low_coh<0.00,[NOISE_CANDIDATE:few_pos<3;low_coh<0.00] 283_i...
8,224,245,,,few_pos<3;low_coh<0.00,[NOISE_CANDIDATE:few_pos<3;low_coh<0.00] 224____
9,186,286,,,few_pos<3;low_coh<0.00,[NOISE_CANDIDATE:few_pos<3;low_coh<0.00] 186____


## Cell 7: Save EDA results for manual inspection

## Cell 8: Label noisy topics (candidates + POS < 10)

This cell labels topics as noisy based on:
1. All existing noise candidates (from previous analysis)
2. All topics with POS words < 10

These labels will be applied to the model for visualization and can be used in downstream analysis.


In [7]:
# Cell 8: Label noisy topics (candidates + POS < 10)

# Identify all topics to label as noisy:
# 1. Existing noise candidates
# 2. Topics with POS words < 10

noise_candidates = quality_df[quality_df["noise_candidate"]].copy()
topics_few_pos = quality_df[quality_df["n_pos_words"].fillna(0) < 10].copy()

# Combine both sets (remove duplicates)
all_noisy_topics = pd.concat([noise_candidates, topics_few_pos]).drop_duplicates(subset=["Topic"])

print("=" * 80)
print("Noisy Topic Labeling Summary")
print("=" * 80)
print(f"Existing noise candidates: {len(noise_candidates)}")
print(f"Topics with POS words < 10: {len(topics_few_pos)}")
print(f"Total unique noisy topics to label: {len(all_noisy_topics)}")
print("=" * 80)

# Create inspection labels for all noisy topics
def create_noise_label(row) -> str:
    """Create inspection label for a noisy topic."""
    base_name = str(row.get("Name", "") or "").strip()
    reasons = []
    
    # Check why it's noisy
    if row.get("noise_candidate", False):
        reasons.append("noise_candidate")
    if pd.notna(row.get("n_pos_words")) and row["n_pos_words"] < 10:
        reasons.append(f"pos<10({int(row['n_pos_words'])})")
    
    reason_str = ";".join(reasons)
    if not reason_str:
        return base_name
    
    return f"[NOISE:{reason_str}] {base_name}"

all_noisy_topics["inspection_label"] = all_noisy_topics.apply(create_noise_label, axis=1)

# Display summary
print("\n=== Topics to be labeled as noisy ===")
display(
    all_noisy_topics[["Topic", "Count", "n_pos_words", "coherence_c_v_pos", "inspection_label"]]
    .sort_values("Topic")
)

# Apply labels to model
noise_labels = {
    int(row.Topic): str(row.inspection_label)
    for row in all_noisy_topics.itertuples(index=False)
}

print(f"\nPrepared {len(noise_labels)} labels for noisy topics")

# Merge with existing labels
existing_labels = getattr(topic_model, "custom_labels_", None)
labels_dict = {}

if isinstance(existing_labels, dict):
    labels_dict.update(existing_labels)
    print(f"Found {len(existing_labels)} existing labels")

labels_dict.update(noise_labels)

with stage_timer("Setting custom labels for noisy topics"):
    topic_model.set_topic_labels(labels_dict)
    LOGGER.info("Updated labels for %d topics (noisy + existing)", len(labels_dict))

print(f"\n✓ Applied labels to {len(noise_labels)} noisy topics")
print(f"✓ Total labels in model: {len(labels_dict)}")

# Save updated model in BOTH formats: native BERTopic directory AND wrapper pickle
save_model = True  # Set to False if you don't want to save yet
if save_model:
    # 1. Save as native BERTopic model (directory format)
    native_model_dir = BASE_DIR / EMBEDDING_MODEL / f"model_{PARETO_RANK}_with_noise_labels"
    # Remove directory if it exists (BERTopic.save() will recreate it)
    if native_model_dir.exists() and native_model_dir.is_dir():
        import shutil
        shutil.rmtree(native_model_dir)
    
    with stage_timer(f"Saving native BERTopic model with noise labels to {native_model_dir}"):
        topic_model.save(str(native_model_dir))
        LOGGER.info("Saved native BERTopic model with noise labels to %s", native_model_dir)
        print(f"✓ Saved native BERTopic model to: {native_model_dir}")
    
    # 2. Save as wrapper pickle (file format) - only if wrapper was loaded
    if wrapper is not None:
        wrapper_pickle_path = BASE_DIR / EMBEDDING_MODEL / f"model_{PARETO_RANK}_with_noise_labels.pkl"
        backup_existing_file(wrapper_pickle_path)
        
        with stage_timer(f"Saving wrapper with noise labels to {wrapper_pickle_path.name}"):
            with open(wrapper_pickle_path, "wb") as f:
                pickle.dump(wrapper, f)
            LOGGER.info("Saved wrapper with noise labels to %s", wrapper_pickle_path)
            print(f"✓ Saved wrapper pickle to: {wrapper_pickle_path}")
    else:
        print("⚠️  Wrapper not available (loaded native model), skipping wrapper save")


Noisy Topic Labeling Summary
Existing noise candidates: 13
Topics with POS words < 10: 20
Total unique noisy topics to label: 20

=== Topics to be labeled as noisy ===


Unnamed: 0,Topic,Count,n_pos_words,coherence_c_v_pos,inspection_label
12,17,2064,,,[NOISE:noise_candidate] 17____
11,18,2062,,,[NOISE:noise_candidate] 18____
43,106,433,6.0,0.201386,[NOISE:pos<10(6)] 106_wasa_thata_likea_sa
359,120,399,4.0,0.538068,[NOISE:pos<10(4)] 120_aye_string_parts_fianca
10,132,376,,,[NOISE:noise_candidate] 132____
5,141,360,1.0,1.0,[NOISE:noise_candidate;pos<10(1)] 141_rush___
0,182,291,2.0,0.345404,[NOISE:noise_candidate;pos<10(2)] 182_tough_um...
4,183,290,1.0,1.0,[NOISE:noise_candidate;pos<10(1)] 183_division...
9,186,286,,,[NOISE:noise_candidate] 186____
76,205,260,8.0,0.232126,[NOISE:pos<10(8)] 205_fuckup_rule_horny_rules


[2025-12-05 16:29:46,565] [INFO] ▶ Setting custom labels for noisy topics | start
[2025-12-05 16:29:46,611] [INFO] Updated labels for 20 topics (noisy + existing)
[2025-12-05 16:29:46,611] [INFO] ■ Setting custom labels for noisy topics | completed in 0.04 s
[2025-12-05 16:29:46,613] [INFO] ▶ Saving native BERTopic model with noise labels to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels | start



Prepared 20 labels for noisy topics

✓ Applied labels to 20 noisy topics
✓ Total labels in model: 20


[2025-12-05 16:29:58,475] [INFO] Saved native BERTopic model with noise labels to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels
[2025-12-05 16:29:58,476] [INFO] ■ Saving native BERTopic model with noise labels to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels | completed in 11.86 s


✓ Saved native BERTopic model to: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels


[2025-12-05 16:30:03,496] [INFO] Backed up existing file: model_1_with_noise_labels.pkl -> model_1_with_noise_labels_backup_20251205_162958.pkl
[2025-12-05 16:30:03,497] [INFO] ▶ Saving wrapper with noise labels to model_1_with_noise_labels.pkl | start
[2025-12-05 16:30:17,894] [INFO] Saved wrapper with noise labels to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels.pkl
[2025-12-05 16:30:17,895] [INFO] ■ Saving wrapper with noise labels to model_1_with_noise_labels.pkl | completed in 14.40 s


✓ Saved wrapper pickle to: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels.pkl


In [8]:
# Cell 7: Save topic quality table + noisy-topic subset

quality_path_full = OUTPUT_DIR / f"topic_quality_{MODEL_NAME_SAFE}.csv"
quality_path_noise = OUTPUT_DIR / f"topic_noise_candidates_{MODEL_NAME_SAFE}.csv"

for path, df_to_save in [
    (quality_path_full, quality_df),
    (quality_path_noise, quality_df[quality_df["noise_candidate"]]),
]:
    backup_existing_file(path)
    with stage_timer(f"Saving {path.name}"):
        df_to_save.to_csv(path, index=False)
        LOGGER.info("Saved %d rows to %s", len(df_to_save), path)

print("Saved full quality table to:", quality_path_full)
print("Saved noise candidates to:", quality_path_noise)


[2025-12-05 16:30:17,916] [INFO] Backed up existing file: topic_quality_paraphrase-MiniLM-L6-v2.csv -> topic_quality_paraphrase-MiniLM-L6-v2_backup_20251205_163017.csv
[2025-12-05 16:30:17,917] [INFO] ▶ Saving topic_quality_paraphrase-MiniLM-L6-v2.csv | start
[2025-12-05 16:30:17,941] [INFO] Saved 368 rows to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/results/stage06_eda/topic_quality_paraphrase-MiniLM-L6-v2.csv
[2025-12-05 16:30:17,942] [INFO] ■ Saving topic_quality_paraphrase-MiniLM-L6-v2.csv | completed in 0.02 s
[2025-12-05 16:30:17,943] [INFO] Backed up existing file: topic_noise_candidates_paraphrase-MiniLM-L6-v2.csv -> topic_noise_candidates_paraphrase-MiniLM-L6-v2_backup_20251205_163017.csv
[2025-12-05 16:30:17,944] [INFO] ▶ Saving topic_noise_candidates_paraphrase-MiniLM-L6-v2.csv | start
[2025-12-05 16:30:17,947] [INFO] Saved 13 rows to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predicto

Saved full quality table to: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/results/stage06_eda/topic_quality_paraphrase-MiniLM-L6-v2.csv
Saved noise candidates to: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/results/stage06_eda/topic_noise_candidates_paraphrase-MiniLM-L6-v2.csv


## Cell 8 (Optional): Add labels to noisy topics in the model

This cell is optional. It marks noisy topics inside the BERTopic model (for visualizations, etc.) but does **not** delete topics. You can later overwrite these when you add LLM-generated labels.


In [9]:
# Cell 8 (optional): Push inspection labels into BERTopic for noisy topics

# Build a label dict only for candidate noisy topics
noise_candidates_df = quality_df[quality_df["noise_candidate"]]

noise_labels = apply_noise_labels_to_model(
    topic_model,
    quality_df,
    only_noise_candidates=True,
)

print(f"Prepared {len(noise_labels)} labels for candidate noisy topics")

# Option 1: Only label noisy topics (other topics keep their existing labels / names)
# We need to merge with any existing labels; otherwise, set_topic_labels would overwrite them all.
# If you haven't set custom labels before, this will just set labels for noisy topics.

existing_labels = getattr(topic_model, "custom_labels_", None)
labels_dict = {}

if isinstance(existing_labels, dict):
    labels_dict.update(existing_labels)

labels_dict.update(noise_labels)

with stage_timer("Setting custom labels for noisy topics"):
    topic_model.set_topic_labels(labels_dict)  # uses BERTopic's custom_labels_ field
    LOGGER.info("Updated labels for %d topics (noise candidates + existing)", len(labels_dict))

# Save updated model in BOTH formats: native BERTopic directory AND wrapper pickle
# 1. Save as native BERTopic model (directory format)
native_model_dir = BASE_DIR / EMBEDDING_MODEL / f"model_{PARETO_RANK}_with_noise_labels"
# Remove directory if it exists (BERTopic.save() will recreate it)
if native_model_dir.exists() and native_model_dir.is_dir():
    import shutil
    shutil.rmtree(native_model_dir)

with stage_timer(f"Saving native BERTopic model with noise labels to {native_model_dir}"):
    topic_model.save(str(native_model_dir))
    LOGGER.info("Saved native BERTopic model with noise labels to %s", native_model_dir)
    print(f"✓ Saved native BERTopic model to: {native_model_dir}")

# 2. Save as wrapper pickle (file format) - only if wrapper was loaded
if wrapper is not None:
    wrapper_pickle_path = BASE_DIR / EMBEDDING_MODEL / f"model_{PARETO_RANK}_with_noise_labels.pkl"
    backup_existing_file(wrapper_pickle_path)
    
    with stage_timer(f"Saving wrapper with noise labels to {wrapper_pickle_path.name}"):
        with open(wrapper_pickle_path, "wb") as f:
            pickle.dump(wrapper, f)
        LOGGER.info("Saved wrapper with noise labels to %s", wrapper_pickle_path)
        print(f"✓ Saved wrapper pickle to: {wrapper_pickle_path}")
else:
    print("⚠️  Wrapper not available (loaded native model), skipping wrapper save")


[2025-12-05 16:30:17,974] [INFO] Prepared 13 labels for noise candidates
[2025-12-05 16:30:17,976] [INFO] ▶ Setting custom labels for noisy topics | start
[2025-12-05 16:30:18,010] [INFO] Updated labels for 13 topics (noise candidates + existing)
[2025-12-05 16:30:18,011] [INFO] ■ Setting custom labels for noisy topics | completed in 0.03 s
[2025-12-05 16:30:18,012] [INFO] ▶ Saving native BERTopic model with noise labels to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels | start


Prepared 13 labels for candidate noisy topics


[2025-12-05 16:30:32,619] [INFO] Saved native BERTopic model with noise labels to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels
[2025-12-05 16:30:32,621] [INFO] ■ Saving native BERTopic model with noise labels to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels | completed in 14.61 s


✓ Saved native BERTopic model to: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels


[2025-12-05 16:30:38,695] [INFO] Backed up existing file: model_1_with_noise_labels.pkl -> model_1_with_noise_labels_backup_20251205_163032.pkl
[2025-12-05 16:30:38,696] [INFO] ▶ Saving wrapper with noise labels to model_1_with_noise_labels.pkl | start
[2025-12-05 16:31:05,555] [INFO] Saved wrapper with noise labels to /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels.pkl
[2025-12-05 16:31:05,557] [INFO] ■ Saving wrapper with noise labels to model_1_with_noise_labels.pkl | completed in 26.86 s


✓ Saved wrapper pickle to: /home/polina/Documents/goodreads_romance_research_cursor/billionaire_novels_rating_predictor/models/retrained/paraphrase-MiniLM-L6-v2/model_1_with_noise_labels.pkl
