# Visualization Toolkit Usage Guide

This notebook provides a suite of semantic analysis and visualization tools to explore word-level patterns and relationships in qualitative text data. It supports multiple clustering and embedding strategies for t-SNE, semantic networks, and word clouds.

### What You Can Do with This Toolkit

- **Generate Word Clouds**  
  Visualize the most frequent and salient terms across your dataset or within filtered subsets based on keywords or project name.

- **Plot t-SNE Semantic Maps**  
  Reduce high-dimensional similarity or co-occurrence matrices into 2D using t-SNE to highlight semantic proximity between words. Seed words are emphasized to anchor interpretation.

- **Create Word-Based Heatmaps and Semantic Networks**  
  Explore how terms relate to each other in both visual space and co-occurrence structure:
  
  - **Basic Heatmap**: Highlights semantically clustered keywords based on selected embeddings or similarity matrices.
  - **Heatmap + Network (Black & White)**: Adds a basic network graph on top of the heatmap using default node/edge colors (no bundling, no category styling).
  - **Heatmap + Network (Colored Nodes & Edges)**: Fully stylized version including colored clusters, semantic links, and optional edge styling.

- **Visualize Code Co-Occurrence Heatmaps**  
  Plot the frequency with which qualitative codes appear together in the same entries. 

These tools help you visually investigate language patterns, conceptual clustering, and topic proximity—whether you’re doing grounded theory, thematic analysis, or exploratory semantic mapping.

### How to Use
1. **Prepare Your Dataset**  
   Make sure your dataset is a `.csv` or DataFrame with at least:
   - A `text` column (raw text)
   - Optionally, a `project` column (for subsetting)
   - optionally a `codes' column` (for subsetting, and analysis)
   - Please see READ.md for the complete schema and more possibilities.

2. **Set Key Parameters**  
   Most functions accept:
   - `stopwords_path`: path to extra stopwords (optional)
   - `clustering_method`: 1 = RoBERTa, 2 = Jaccard, 3 = PMI, 4 = TF-IDF
   - `distance_metric`: "cosine" or "default" (used for similarity matrix choice)

3. **Run Visualizations**
You can explore and generate a variety of visual outputs using the execution blocks provided in the **Analytics Tools** and **Advanced Analytics Tools** sections.

These include:
- **Word Clouds** for highlighting high-frequency terms in selected texts or projects.
- **t-SNE Semantic Maps** to project word relationships into 2D space for visualizing proximity and clusters.
- **Word-Based Heatmaps** to show how frequently words co-occur.
- **Semantic Network Graphs** (with optional heatmaps), including:
  - show relationships in text
  - basic  node/seed networks,
  - networks with customized node colors,
  - and networks with or without edge bundling.
- **Code-based Heatmap** to visualize co-occurrence patterns among qualitative codes

📍 Simply scroll down to the relevant execution cells to run and customize these visualizations based on your dataset.

### File Structure Notes
- Define constants like `DATA_DIR`, `OUTPUT_DIR`, and stopword files before running.
- Ensure required files and embeddings are preloaded or generated using prior pipeline steps.

> Tip: Start by testing on one project or topic before scaling to all data.

# TESTING, PLEASE DO NOT SHARE. CITE WITHOUT WRITTEN PERMISSION

## Setup 

### Packages Loaded

In [None]:
# Python built-ins
# Using Python 3.11.13
import os
import urllib.request
from functools import lru_cache
from collections import Counter
import warnings
import ast
import sys
import platform
import importlib

# Data loading
import pandas as pd
import numpy as np
from dotenv import load_dotenv

# Natural Language Processing (NLP)
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

# Ensure required NLTK resources are available
try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")

try:
    nltk.data.find("tokenizers/punkt_tab")
except LookupError:
    nltk.download("punkt_tab")

try:
    nltk.data.find("corpora/stopwords")
except LookupError:
    nltk.download("stopwords")

try:
    nltk.data.find("corpora/wordnet")
except LookupError:
    nltk.download("wordnet")

try:
    nltk.data.find("taggers/averaged_perceptron_tagger")
except LookupError:
    nltk.download("averaged_perceptron_tagger")

# Sentence / transformer embeddings
from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModel

# Machine Learning / Math
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, leaves_list
import scipy.cluster.hierarchy as hierarchy
import torch

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import matplotlib.cm as cm
from wordcloud import WordCloud
from PIL import Image, ImageDraw
import pprint

# Validation & helpers
from pydantic import BaseModel, FilePath, ValidationError, field_validator
from typing import List, Optional, Union, Any
import nbimporter  # import from notebook
# Project-Local Modules
# ⚠️ Environment Warning
# To successfully import local Python modules (e.g., `vis_tool_core.py`)
# into this Jupyter notebook, ensure that the `.py` file is located in the **same directory**
# as this notebook.
import warnings 

from function import vis_tool_core
# Force reload the module to apply any code changes
importlib.reload(vis_tool_core)
from function.vis_tool_core import *

#### Version Check

This section checks current version of packages loaded 

In [None]:
# version_check.py 
import platform, importlib
from packaging.version import Version
from packaging.specifiers import SpecifierSet

SUPPORTED = {
    "python": ">=3.10,<3.13.2",
    "numpy": ">=1.24,<2.2",
    "pandas": ">=1.5,<2.3",
    "nltk": ">=3.8,<4",
    "gensim": ">=4.3,<5",
    "sklearn": ">=1.2,<1.5",
    "sentence_transformers": ">=2.2,<3.6",
    "transformers": ">=4.36,<5.1",
    "torch": ">=2.1,<2.8",
    "matplotlib": ">=3.7,<4",
    "seaborn": ">=0.12,<1",
    "networkx": ">=3,<4",
    "wordcloud": ">=1.9,<2",
    "dash": ">=2.10,<3",
    "plotly": ">=5.0,<6",  
    "tqdm": ">=4.65,<5",
    "joblib": ">=1.2,<2",     
    "dill": ">=0.3.6,<0.4",
    "python-dotenv": ">=0.15,<2",
    "pydantic": ">=1.10,<3",
}
print(f"\n Environment Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
if not check_versions(SUPPORTED):
    print("Please align your environment to the required versions above.")

### Environment Configuration

In this section, we define the core directory structure used throughout the replication project. These paths help organize:

- **raw data**,  
- **trained models**,  
- **clustering results**, and  
- **final outputs**.

This setup also makes it easier for users to customize where intermediate results and final outputs will be saved. For example, by changing these directory names, users can **create their own versions of model runs or clustering outputs without overwriting previous results.**

All directories will be automatically created if they don't already exist.

In [None]:
os.makedirs("data", exist_ok=True)
os.makedirs("input", exist_ok=True)

# Define base directory
BASE_DIR = os.getcwd()

# Define project directories
DATA_DIR = os.path.join(BASE_DIR, "data")
INPUT_DIR = os.path.join(BASE_DIR, "input")
BACKUP_DIR = os.path.join(BASE_DIR, "backup")
MODEL_DIR = os.path.join(DATA_DIR, "models", "auto_model")
CLUSTERING_DIR = os.path.join(DATA_DIR, "models", "clusterings")
OUTPUT_DIR = os.path.join(DATA_DIR, "outputs")
LAST_CSV_PATH = os.path.join(OUTPUT_DIR, "last_csv_path.txt")

# Paths
CSV_PATH = os.path.join(DATA_DIR, "data.csv")
STOP_LIST_FILE = os.path.join(INPUT_DIR, "additional_stops.txt")

# Cache 
LAST_CONFIG_PATH = "last_run_config.json"

# Create directories if they don't exist
for d in (BACKUP_DIR, MODEL_DIR, CLUSTERING_DIR, OUTPUT_DIR):
    os.makedirs(d, exist_ok=True)

# Initialize NLTK components
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

# Initialize lemmatizer (only once)
lemmatizer = WordNetLemmatizer()

# Torch Device
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

# HuggingFace token for private models
load_dotenv()
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN") or "" #this is free to get, but this might be packageable with one of the BERT models. Currently using roBERTa, but the distilled version is doable.

# Enable GPU acceleration if available
USE_GPU_ACCELERATION = True
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
print(f"Device: {device}") #not optimized for MPS on mac, but should access.

# Define and load the preferred model
# Use 'roberta-base' for tests (smaller, faster), and switch back to 'all-roberta-large-v1' for full runs
MODEL_NAME = "roberta-base"  

print(f"Loading '{MODEL_NAME}'...")

# Load model and tokenizer
TOKENIZER = AutoTokenizer.from_pretrained(MODEL_NAME)
# Load HuggingFace model
# ⚠️ Note on add_pooling_layer:
# - Plain RoBERTa checkpoints (e.g., "roberta-base", "roberta-large") do NOT include a pooling layer.
#   If you load them with the default settings, HuggingFace will create a random pooler and issue a warning.
#   To avoid this, we explicitly set add_pooling_layer=False for these cases.
# - For Sentence-Transformers models (e.g., "all-roberta-large-v1"), KEEP the default (pooling layer included).
#   These models rely on pooling for producing sentence embeddings.
#
# So: 
#   MODEL = AutoModel.from_pretrained(MODEL_NAME, add_pooling_layer=False)  # for roberta-base / roberta-large
#   MODEL = AutoModel.from_pretrained(MODEL_NAME)                          # for embedding models like all-roberta-large-v1

MODEL = AutoModel.from_pretrained(MODEL_NAME, add_pooling_layer=False)
MAX_TOKENS = TOKENIZER.model_max_length
warnings.filterwarnings(
    "ignore",
    message="Some weights of RobertaModel were not initialized"
)

# Move model to the appropriate device
MODEL.to(device)

# Optional: Enable CUDA optimizations for better performance on NVIDIA GPUs
if device.type == 'cuda':
    torch.backends.cudnn.benchmark = True
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"CUDA memory allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
    print(f"CUDA memory reserved: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")

print(f"Loaded '{MODEL_NAME}' on {device}")
print(f"Maximum length of tokens is {MAX_TOKENS}")

### Stopword Expansion and Semantic Word Family Definitions

This section defines a comprehensive list of stopwords, extending NLTK’s default stopword set with:
- **punctuation**, 
- **contractions**, 
- **common filler words**, and 
- **project-specific conversational terms** that are semantically uninformative in analysis.

We also build a custom `WORD_FAMILIES` dictionary, which groups related words into unified concepts (e.g., "death", "caregiver", "memory"). This allows the model to:
- **compress synonyms and variations** into semantically meaningful units,
- **reduce noise** in the embedding space,
- and **support cultural/qualitative interpretation** of the results.

This section also includes validation checks to:
- Ensure **no accidental overlaps** between stopwords and key analytical terms,
- Detect **redundant words across families**, 
- And print summaries for user verification.

These definitions are critical for interpretability in downstream visualization and clustering.

In [None]:
# Base stop words from NLTK and punctuation
default_stop_words = set(stopwords.words('english') + list(string.punctuation))

# Common contractions and special characters
common_special_chars = {'...', "''", '""', "``", "--", "n't", "'s", '|'}
default_stop_words.update(common_special_chars)

# Additional stop words for conversation analysis
additional_stops = {
    # Common verbs that don't add semantic value
    'would', 'could', 'may', 'also', 'one', 'like', 'get', 'well', 
    'many', 'much', 'even', 'said', 'say', 'says', 'see', 'seen',
    'use', 'used', 'using', 'way', 'ways', 'make', 'makes', 'made',
    'take', 'takes', 'took', 'taken', 'go', 'goes', 'going', 'went',
    'come', 'comes', 'coming', 'came', 'try', 'tries', 'tried',
    
    # General placeholders
    'thing', 'things', 'something', 'anything', 'everything',
    'someone', 'anyone', 'everyone', 'somebody', 'anybody', 'everybody',
    
    # Filler words and conversational markers
    'um', 'uh', 'hmm', 'oh', 'huh', 'uhhuh', 'yeah', 'okay',
    'sorta', 'kinda', 'basically', 'literally', 'honestly', 'anyway',
    'whatever', 'actually', 'really', 'just', 'pretty', 'right',
    
    # Common pronouns and contractions
    'im', 'youre', 'shes', 'hes', 'theyre', 'ive', 'dont', 'cant',
    'doesnt', 'didnt', 'thats', 'theres', 'heres', 'couldnt', 'shouldnt', 
    'wouldnt', 'lets', 'youve', 'weve', 'theyve', 'whats', 'whos', 'hows', 
    'wheres', 'gotta', 'gonna', 'wanna', 'aint', 'alot', 'isnt', 'wont',
    
    # Enhanced common words to exclude from auto-selection
    'tell', 'told', 'right', 'lot', 'way', 'kind', 'bit', 'maybe', 
    'still', 'stuff', 'sure', 'getting', 'gets', 'goes', 'gone',
    'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten',
    'first', 'second', 'third', 'fourth', 'fifth', 'last', 'next',
    'should', 'might', 'must', 'may', 'can', 'cannot',
    'ah', 'wow', 'yes', 'no', 'nope', 'ok',
    'hey', 'hi', 'hello', 'bye', 'goodbye', 'etc', 'etc.',
    
    # Single letters and 2 letter pairs
    'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
    'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
    'aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am',
    'an', 'ao', 'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay', 'az',
    'ba', 'bb', 'bc', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bk', 'bl', 'bm',
    'bn', 'bo', 'bp', 'bq', 'br', 'bs', 'bt', 'bu', 'bv', 'bw', 'bx', 'by', 'bz',
    'ca', 'cb', 'cc', 'cd', 'ce', 'cf', 'cg', 'ch', 'ci', 'cj', 'ck', 'cl', 'cm',
    'cn', 'co', 'cp', 'cq', 'cr', 'cs', 'ct', 'cu', 'cv', 'cw', 'cx', 'cy', 'cz',
    'da', 'db', 'dc', 'dd', 'de', 'df', 'dg', 'dh', 'di', 'dj', 'dk', 'dl', 'dm',
    'dn', 'do', 'dp', 'dq', 'dr', 'ds', 'dt', 'du', 'dv', 'dw', 'dx', 'dy', 'dz',
    'ea', 'eb', 'ec', 'ed', 'ee', 'ef', 'eg', 'eh', 'ei', 'ej', 'ek', 'el', 'em',
    'en', 'eo', 'ep', 'eq', 'er', 'es', 'et', 'eu', 'ev', 'ew', 'ex', 'ey', 'ez',
    'fa', 'fb', 'fc', 'fd', 'fe', 'ff', 'fg', 'fh', 'fi', 'fj', 'fk', 'fl', 'fm',
    'fn', 'fo', 'fp', 'fq', 'fr', 'fs', 'ft', 'fu', 'fv', 'fw', 'fx', 'fy', 'fz',
    'ga', 'gb', 'gc', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gj', 'gk', 'gl', 'gm',
    'gn', 'go', 'gp', 'gq', 'gr', 'gs', 'gt', 'gu', 'gv', 'gw', 'gx', 'gy', 'gz',
    'ha', 'hb', 'hc', 'hd', 'he', 'hf', 'hg', 'hh', 'hi', 'hj', 'hk', 'hl', 'hm',
    'hn', 'ho', 'hp', 'hq', 'hr', 'hs', 'ht', 'hu', 'hv', 'hw', 'hx', 'hy', 'hz',
    'ia', 'ib', 'ic', 'id', 'ie', 'if', 'ig', 'ih', 'ii', 'ij', 'ik', 'il', 'im',
    
    # Project-specific stopwords from file
    'a', 'um', 'an', 'the', 'have', 'dont', 'get', 'know', 'there', 'org', 'happen', 'find'
}

# Remove 'few', 'many', and 'more' from stop words as they're needed in word families
# if 'few' in additional_stops:
#     additional_stops.remove('few')
# if 'many' in additional_stops:
#     additional_stops.remove('many')
# if 'more' in additional_stops:
#     additional_stops.remove('more')
# if 'few' in default_stop_words:
#     default_stop_words.remove('few')
# if 'many' in default_stop_words:
#     default_stop_words.remove('many')
# if 'more' in default_stop_words:
#     default_stop_words.remove('more')

# Update the default stop words with our additional list
default_stop_words.update(additional_stops)

# ------------------------------------------------
# Normalized Word-Family Map
# ------------------------------------------------


# word families allow compression to move from embedding space (semantic), to meaning space (cultural); Basically, the idea is to use to get at more meaningful concepts and minimize false negatives for use with qual analyses (see Li and Abramson 2021; Abramson 2011)
WORD_FAMILIES = {
    "education": ["college", "schooling", "graduate school", "university"],
    "people": ["person", "student", "teacher"],
    "dementia": ["dementia"]
}

# Remove empty word families
WORD_FAMILIES = {k: v for k, v in WORD_FAMILIES.items() if v}

# ------------------------------------------------
# Summary and checks
# ------------------------------------------------

# Print summary information
print(f"Total stop words: {len(default_stop_words)}")
print(f"Total word family compressions: {len(WORD_FAMILIES)}")

# Check for redundancies in stop words
duplicate_stops = [item for item in additional_stops if item in default_stop_words and item not in additional_stops]
if duplicate_stops:
    print(f"Warning: Found {len(duplicate_stops)} redundant stop words")
else:
    print("No redundant stop words found")

# Check for redundancies in word families
all_words = []
word_to_families = {}
for family, words in WORD_FAMILIES.items():
    for word in words:
        all_words.append(word)
        if word not in word_to_families:
            word_to_families[word] = []
        word_to_families[word].append(family)
    
duplicate_words = [word for word in all_words if all_words.count(word) > 1]
if duplicate_words:
    print(f"Warning: Found {len(set(duplicate_words))} words appearing in multiple word families:")
    for word in sorted(set(duplicate_words)):
        families = word_to_families[word]
        print(f"  '{word}' appears in: {', '.join(families)}")
else:
    print("No words appear in multiple word families")

# Check for overlap between word families and stop words
stop_words_set = set(default_stop_words) | set(additional_stops)
overlap_words = []
overlap_by_family = {}

for family, words in WORD_FAMILIES.items():
    family_overlaps = [word for word in words if word in stop_words_set]
    if family_overlaps:
        overlap_by_family[family] = family_overlaps
        overlap_words.extend(family_overlaps)

if overlap_words:
    print(f"\nWarning: Found {len(set(overlap_words))} words that appear in both word families and stop words:")
    for family, words in sorted(overlap_by_family.items()):
        print(f"  Family '{family}' has {len(words)} stop words: {', '.join(sorted(words))}")
else:
    print("\nNo overlap between word families and stop words")

# Verify no words in word families are in stop words after fixing
stop_words_set = set(default_stop_words)
all_family_words = [word for family_words in WORD_FAMILIES.values() for word in family_words]
remaining_overlaps = [word for word in all_family_words if word in stop_words_set]
print(f"\nAfter fixes, remaining overlaps between word families and stop words: {len(remaining_overlaps)}")
if remaining_overlaps:
    print(f"Remaining overlapping words: {', '.join(sorted(remaining_overlaps))}")

### Global Variables Updates

Sync updated variables from notebook into vis_tool_core module 

In [None]:
# Word family mapping
vis_tool_core.WORD_FAMILIES = WORD_FAMILIES  

# Stopword lists
vis_tool_core.additional_stops = additional_stops  
vis_tool_core.default_stop_words = default_stop_words  

# NLP tools
vis_tool_core.TOKENIZER = TOKENIZER  
vis_tool_core.lemmatizer = lemmatizer  
vis_tool_core.MODEL = MODEL
vis_tool_core.MAX_TOKENS = MAX_TOKENS

# Output and cache directories
vis_tool_core.OUTPUT_DIR = OUTPUT_DIR  
vis_tool_core.CLUSTERING_DIR = CLUSTERING_DIR  

### Validators

In [None]:
# Suppress Pydantic deprecation warnings
warnings.filterwarnings("ignore", category=UserWarning, module="pydantic")

# Temporarily using compatibility mode

class VisualsInput(BaseModel):
    filepath: str
    stop_list: Optional[str] = None
    num_words: int = 10
    clustering_method: int = 1
    distance_metric: str = "default" # "default" | "cosine"
    reuse_clusterings: bool = False
    window_size: int = 5
    min_word_frequency: int = 2
    cross_pos_normalize: bool = False
    projects: Optional[List[str]] = None
    data_groups: Optional[List[str]] = None
    codes: Optional[List[str]] = None
    seed_words: Optional[str] = None

    # ---------- validators ----------
    @field_validator("num_words")
    def validate_num_words(cls, v):
        if v <= 0:
            raise ValueError("num_words must be greater than 0")
        return v

    @field_validator("clustering_method")
    def validate_clustering_method(cls, v):
        if v not in [1, 2, 3, 4]:
            raise ValueError("clustering_method must be 1-4")
        return v

    @field_validator("window_size")
    def validate_window_size(cls, v):
        if v <= 0:
            raise ValueError("window_size must be greater than 0")
        return v

    @field_validator("min_word_frequency")
    def validate_min_word_frequency(cls, v):
        if v <= 0:
            raise ValueError("min_word_frequency must be greater than 0")
        return v

    # ---------- Configuration ----------
    model_config = ConfigDict(extra="allow", validate_assignment=True)


## Helper Functions

### 1. Wordcloud 

This plot shows the most-frequent, non-trivial words in the selected texts—bigger words = higher frequency—so you can spot dominant topics at a glance.



In [None]:
# WordCloud Function


warnings.filterwarnings("ignore")

def make_circular_mask(diam: int = 1600, border: int = 5) -> np.ndarray:
    img = Image.new("L", (diam, diam), 0)
    ImageDraw.Draw(img).ellipse([(border, border), (diam - border, diam - border)], fill=255)
    return 255 - np.array(img)  # WordCloud expects black = non-fillable


def generate_wordcloud(
    text_series,
    stopwords_path=None,
    title="Wordcloud",
    out_dir=OUTPUT_DIR,
    categories=None
):
    print("\n✔ [OK] Building word-cloud…")

    # Stopwords
    stop_words = set(stopwords.words("english"))
    if stopwords_path and os.path.exists(stopwords_path):
        with open(stopwords_path, 'r') as f:
            stop_words.update(f.read().splitlines())

    # Tokenize and filter
    combined_text = ' '.join(text_series.dropna().astype(str))
    tokens = word_tokenize(combined_text.lower())
    filtered_tokens = [w for w in tokens if w.isalnum() and w not in stop_words and len(w) > 2]
    paragraph_count = len(text_series)

    # Mask
    mask = make_circular_mask()
    print(f"✔ [OK] Mask ready {mask.shape}")

    # Frequencies
    word_freq = Counter(filtered_tokens)
    print(f"✔ [OK] {len(word_freq):,} unique tokens")

    # Categories
    if categories is None:
        categories = {}
    
    word2cat = {w: cat for cat, info in categories.items() for w in info["words"]}

    # Color function
    def colour_for_word(word, **_):
        cat = word2cat.get(word.lower())
        if cat:
            r, g, b, _ = categories[cat]["color"]
            return f"#{int(r * 255):02x}{int(g * 255):02x}{int(b * 255):02x}"
        return "#bcbcbc"

    # Generate WordCloud
    wc = (WordCloud(
        width=1600, height=1600, mask=mask, background_color="white",
        max_words=600, min_font_size=5, max_font_size=160, font_step=1,
        margin=1, prefer_horizontal=0.3, random_state=42,
        collocations=False, repeat=True, mode="RGBA"
    )
        .generate_from_frequencies(word_freq)
        .recolor(color_func=colour_for_word, random_state=42))

    print("✔ [OK] WordCloud generated")

    # Plot
    fig, ax = plt.subplots(figsize=(12, 12), dpi=300)
    fig.patch.set_facecolor("white")
    ax.imshow(wc.to_array(), interpolation="bilinear")
    ax.axis("off")

    fig.suptitle(title, fontsize=36, fontweight="bold", y=1.05, color="black")
    fig.text(0.5, 0.975, f"Analysis of {paragraph_count:,} Paragraphs of Text",
             ha="center", va="top", fontsize=18, style="italic", color="#333333")

    handles = [Patch(color=info["color"], label=cat) for cat, info in categories.items()]
    legend = ax.legend(handles=handles,
                       loc="lower center", bbox_to_anchor=(0.5, -0.085),
                       ncol=3, frameon=False, fontsize=14)
    for txt in legend.get_texts():
        txt.set_color("#333333")

    plt.tight_layout(rect=[0, 0, 1, 0.93])

    # Save
    os.makedirs(out_dir, exist_ok=True)
    base = "wordcloud_latest"

    fig.savefig(os.path.join(out_dir, f"{base}.png"),
                dpi=300, bbox_inches="tight", format="png")
    print(f"✔ [OK] Saved {base}.png")

    plt.show()


### 2. Word-based Heatmap 

This plot produces a word-by-speaker heatmap—columns clustered by cosine similarity or co-occurrence, so you can quickly see which keywords co-occur across interviews and how they group into thematic clusters.

In [None]:
def run_heatmap_pipeline(input_data):
    """
    Main function to run the heatmap pipeline.
    
    Parameters:
    -----------
    input_data : run_heatmap_pipeline
        Object containing all input parameters
    """
    # Build reverse mapping for word families
    word_to_base = {}
    for base_word, variants in WORD_FAMILIES.items():
        for variant in variants:
            word_to_base[variant.lower()] = base_word

    seed_groups, seed_words, use_group_label = {}, [], False

    df = pd.read_csv(input_data.filepath)

    # Normalize alternative column names if needed
    if 'text' not in df.columns:
        alternatives = [col for col in df.columns if 'text' in col.lower() or 'content' in col.lower() or 'body' in col.lower()]
        if alternatives:
            print(f"'text' column not found, using '{alternatives[0]}' instead.")
            df.rename(columns={alternatives[0]: 'text'}, inplace=True)
        else:
            print("Error: No suitable text column found.")
            return
    
    # We'll use the stop list instead of hardcoding additional common words
    additional_common_words = set()
            
    # Use the pre-loaded stopwords if available
    if hasattr(input_data, 'custom_stopwords') and input_data.custom_stopwords:
        stop_words = input_data.custom_stopwords
    else:
        # Load stop words the old way if not pre-loaded
        stop_words = manage_stop_list(input_data.stop_list, default_stop_words)
        # Add our additional common words to the stop list
        stop_words = stop_words.union(additional_common_words)

    
    # Fix list columns that may be stored as strings
    for col in ['data_group', 'codes']:
        if col in df.columns:
            # Check if first non-null value is a string that looks like a list
            sample = df[col].dropna().iloc[0] if not df[col].dropna().empty else None
            if isinstance(sample, str) and (sample.startswith('[') or ',' in sample):
                df[col] = df[col].apply(lambda x: eval(x) if isinstance(x, str) and x.strip() else 
                                       ([] if pd.isna(x) else [x]))
    
    # Apply Metadata Filters
    if input_data.projects and 'project' in df.columns:
        before = len(df)
        df = df[df['project'].isin(input_data.projects)]
        print(f"Project filter: {before} → {len(df)} rows")
    
    if input_data.data_groups and 'data_group' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list data_group columns
            if df['data_group'].apply(lambda x: isinstance(x, list)).any():
                mask = df['data_group'].apply(lambda x: 
                    isinstance(x, list) and any(item in input_data.data_groups for item in x))
            else:
                mask = df['data_group'].isin(input_data.data_groups)
            df = df[mask]
            print(f"Data group filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in data_group filtering: {e}")
            print(f"Sample data_group values: {df['data_group'].head()}")
    
    if input_data.codes and 'codes' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list codes columns
            if df['codes'].apply(lambda x: isinstance(x, list)).any():
                mask = df['codes'].apply(lambda x: 
                    isinstance(x, list) and any(item in input_data.codes for item in x))
            else:
                mask = df['codes'].isin(input_data.codes)
            df = df[mask]
            print(f"Codes filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in codes filtering: {e}")
            print(f"Sample codes values: {df['codes'].head()}")
    
    if len(df) == 0:
        print("Error: All rows were filtered out. Please check your filter criteria.")
        return
    
    sentences = df['text'].dropna().tolist()

    # Add filtered sentences tracking here
    filtered_sentences = []
    for sentence in sentences:
        if not isinstance(sentence, str):
            print(f"Not a string: type={type(sentence)}, value={sentence}")
        tokens = tokenize_and_filter([sentence], stop_list=stop_words, 
                                   lemmatize=True, 
                                   cross_pos_normalize=input_data.cross_pos_normalize)
        if tokens:  # Only keep sentences that have tokens after filtering
            filtered_sentences.append(sentence)
    print(f"After filtering: {len(filtered_sentences)} valid text segments")

    # Filter out excluded codes if specified
    if hasattr(input_data, 'excluded_codes') and input_data.excluded_codes and 'codes' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list codes columns
            if df['codes'].apply(lambda x: isinstance(x, list)).any():
                # Keep rows where NONE of the excluded codes are present
                mask = df['codes'].apply(lambda x: 
                    isinstance(x, list) and not any(item in input_data.excluded_codes for item in x))
            else:
                # Keep rows where the code is not in excluded_codes
                mask = ~df['codes'].isin(input_data.excluded_codes)
            df = df[mask]
            print(f"Excluded codes filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in excluded_codes filtering: {e}")
    
        # Define a custom word filter function
    def custom_word_filter(word):
        # First normalize with word families
        word_lower = word.lower()
        if word_lower in word_to_base:
            word = word_to_base[word_lower]
        
        # Manual exclusion of common words that should be filtered
        manual_exclusions = {'got', 'get', 'just', 'like', 'many', 'much', 'very', 'really', 'make'}
        
        return (word.lower() not in stop_words and
                word.lower() not in manual_exclusions and
                len(word) > 2 and  # Exclude very short words
                not any(c.isdigit() for c in word) and  # Exclude words with numbers
                re.match(r'^[a-z]+$', word.lower()))  # Only pure alphabetic words
                
    # Check if seed words were provided in the input
    if hasattr(input_data, 'seed_words') and input_data.seed_words and input_data.seed_words.strip().lower() != "none":
        seed_input = input_data.seed_words.strip()

        if ":" in seed_input:
            use_group_label = True
            for part in seed_input.split(";"):
                part = part.strip()
                if ":" in part:
                    group_label, word_str = part.split(":", 1)
                    group_label = group_label.strip().lower()
                    words = [w.strip().lower() for w in word_str.split(",") if w.strip()]
                    seed_groups[group_label] = set(words)
                    seed_words.append(group_label)
                    print(f"Group mode: all {words} will be treated as '{group_label}'")
                else:
                    individuals = [w.strip().lower() for w in part.split(",") if w.strip()]
                    seed_words.extend(individuals)
                    print(f"Individual mode: adding {individuals}")
        else:
            seed_words = [w.strip().lower() for w in seed_input.split(",") if w.strip()]
            print(f"Pure individual word mode: using {seed_words}")
        if use_group_label:
                sentences = [replace_group_words(text, seed_groups) for text in sentences]
    else:
        # Process sentences to get word frequencies for auto-selection of top words
        print("WARNING: No seed words provided or 'NONE' specified. Using top frequent words as seeds... ")
        excluded_words = stop_words.union(set(WORD_FAMILIES.keys()))
        all_tokens = []
        for sentence in sentences:
            tokens = tokenize_and_filter([sentence], stop_list=stop_words,
                                        lemmatize=True,
                                        cross_pos_normalize=input_data.cross_pos_normalize)
            filtered_tokens = [token.lower() for token in tokens if custom_word_filter(token)]
            all_tokens.extend(filtered_tokens)
        word_counts = Counter(all_tokens)
        top_words = [word for word, _ in word_counts.most_common(30)
                    if word.lower() not in excluded_words][:min(10, len(word_counts))]
        seed_words = top_words
        print(f"Top frequent words as seeds: {seed_words}")
       
    start = time.time()

    # Clean and normalize seed words, but they're already preprocessed with word families
    print(f"Original seed words before cleaning: {seed_words}")
    clean_seed_words = clean_words(seed_words)
    print(f"Seed words after cleaning: {clean_seed_words}")
    clean_seeds = clean_words(seed_words)
    seed_words = normalize_words(clean_seeds, stop_words, lemmatize=True, cross_pos_normalize=input_data.cross_pos_normalize)
    seed_words = list(set(seed_words))
    print("Final normalized seed words:", seed_words)


    # choose context source: keep stop-words only for RoBERTa
    if input_data.clustering_method == 1:          # 1 = RoBERTa
        sentences_for_embedding = sentences        # full context
    else:                                          # 2-4 = Jaccard/PMI/TF-IDF
        sentences_for_embedding = filtered_sentences

    word_embeddings, similarity_matrix, co_occurrence_matrix = train_embedding(
        sentences_for_embedding,
        context_window = input_data.window_size,
        stop_list  = stop_words, 
        seed_words = seed_words, 
        clustering_method  = input_data.clustering_method,
        num_words = input_data.num_words, 
        lemmatize = True, 
        min_word_frequency = input_data.min_word_frequency,
        reuse_clusterings  = input_data.reuse_clusterings,
        cross_pos_normalize= getattr(input_data, 'cross_pos_normalize', False),
        distance_metric = getattr(input_data, 'distance_metric', 'default'),
        custom_word_filter = custom_word_filter
    )

    elapsed_time = time.time() - start
    print(f"Embedding generation completed in {elapsed_time:.1f} seconds")

    if word_embeddings is None:
        print("Error: Failed to generate embeddings. Please check your input data and parameters.")
        return
    # Plot Similarity Heatmap
    
    print("Plotting similarity heatmap...")
    fig = plot_heatmap(
        input_data.clustering_method, word_embeddings, 
        similarity_matrix, co_occurrence_matrix, input_data.distance_metric
    )

    filename = f"ONLY_heatmap_{input_data.clustering_method}_{input_data.distance_metric}.png"
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    out_path = os.path.join(OUTPUT_DIR, filename)

    fig.savefig(out_path, dpi=300, bbox_inches="tight")
    print(f"✔ [OK] Saved {out_path}")

    plt.show()
    
    
    print("\nAnalysis complete")
    print(f"\nNetwork visualization method: {input_data.clustering_method}")
    
    if input_data.clustering_method == 1:
        print("Method: RoBERTa – Shows semantic relationships based on contextual embeddings")
    elif input_data.clustering_method == 2:
        if input_data.distance_metric == "cosine":
            print("Method: Jaccard (cosine) – Uses context window vectors to compute cosine-based similarity between word usage patterns")
        elif input_data.distance_metric == "default":
            print("Method: Jaccard (default) – Uses binary co-occurrence counts within a context window to capture word overlap")
    elif input_data.clustering_method == 3:
        print("Method: PMI – Highlights statistically significant word associations based on pointwise mutual information")
    elif input_data.clustering_method == 4:
        if input_data.distance_metric == "cosine":
            print("Method: TF-IDF (cosine) – Uses TF-IDF-weighted context vectors to compute cosine similarity between words")
        elif input_data.distance_metric == "default":
            print("Method: TF-IDF (default) – Uses raw TF-IDF-weighted co-occurrence scores for word associations")


### 3. tSNE 

This function creates a 2-D t-SNE map of your vocabulary—using either the similarity matrix (embeddings) or the co-occurrence matrix—so that spatial distance ≈ lexical / semantic closeness; seed words are drawn larger and in red to spotlight how neighbouring terms cluster around them.

In [None]:
def run_tsne_pipeline(input_data):
    """
    Main function to run the semantic network analysis pipeline.
    
    Parameters:
    -----------
    input_data : run_heatmap_pipeline
        Object containing all input parameters
    """
    # Build reverse mapping for word families
    word_to_base = {}
    for base_word, variants in WORD_FAMILIES.items():
        for variant in variants:
            word_to_base[variant.lower()] = base_word

    seed_groups, seed_words, use_group_label = {}, [], False
    auto_selected_seeds = False  # Flag to track if seeds were auto-selected

    df = pd.read_csv(input_data.filepath)
    
    # Normalize alternative column names if needed
    if 'text' not in df.columns:
        alternatives = [col for col in df.columns if 'text' in col.lower() or 'content' in col.lower() or 'body' in col.lower()]
        if alternatives:
            print(f"'text' column not found, using '{alternatives[0]}' instead.")
            df.rename(columns={alternatives[0]: 'text'}, inplace=True)
        else:
            print("Error: No suitable text column found.")
            return
    
    # We'll use the stop list instead of hardcoding additional common words
    additional_common_words = set()
            
    # Use the pre-loaded stopwords if available
    if hasattr(input_data, 'custom_stopwords') and input_data.custom_stopwords:
        stop_words = input_data.custom_stopwords
    else:
        # Load stop words the old way if not pre-loaded
        stop_words = manage_stop_list(input_data.stop_list, default_stop_words)
        # Add our additional common words to the stop list
        stop_words = stop_words.union(additional_common_words)

    
    # Fix list columns that may be stored as strings
    for col in ['data_group', 'codes']:
        if col in df.columns:
            # Check if first non-null value is a string that looks like a list
            sample = df[col].dropna().iloc[0] if not df[col].dropna().empty else None
            if isinstance(sample, str) and (sample.startswith('[') or ',' in sample):
                df[col] = df[col].apply(lambda x: eval(x) if isinstance(x, str) and x.strip() else 
                                       ([] if pd.isna(x) else [x]))
    
    # Apply Metadata Filters
    if input_data.projects and 'project' in df.columns:
        before = len(df)
        df = df[df['project'].isin(input_data.projects)]
        print(f"Project filter: {before} → {len(df)} rows")
    
    if input_data.data_groups and 'data_group' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list data_group columns
            if df['data_group'].apply(lambda x: isinstance(x, list)).any():
                mask = df['data_group'].apply(lambda x: 
                    isinstance(x, list) and any(item in input_data.data_groups for item in x))
            else:
                mask = df['data_group'].isin(input_data.data_groups)
            df = df[mask]
            print(f"Data group filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in data_group filtering: {e}")
            print(f"Sample data_group values: {df['data_group'].head()}")
    
    if input_data.codes and 'codes' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list codes columns
            if df['codes'].apply(lambda x: isinstance(x, list)).any():
                mask = df['codes'].apply(lambda x: 
                    isinstance(x, list) and any(item in input_data.codes for item in x))
            else:
                mask = df['codes'].isin(input_data.codes)
            df = df[mask]
            print(f"Codes filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in codes filtering: {e}")
            print(f"Sample codes values: {df['codes'].head()}")
    
    if len(df) == 0:
        print("Error: All rows were filtered out. Please check your filter criteria.")
        return
    
    sentences = df['text'].dropna().tolist()
    print(f"Final dataset: {len(sentences)} text segments ready for processing")

    # Add filtered sentences tracking here
    filtered_sentences = []
    for sentence in sentences:
        if not isinstance(sentence, str):
            print(f"Not a string: type={type(sentence)}, value={sentence}")
        tokens = tokenize_and_filter([sentence], stop_list=stop_words, 
                                   lemmatize=True, 
                                   cross_pos_normalize=input_data.cross_pos_normalize)
        if tokens:  # Only keep sentences that have tokens after filtering
            filtered_sentences.append(sentence)
    print(f"After filtering: {len(filtered_sentences)} valid text segments")
    

    # Filter out excluded codes if specified
    if hasattr(input_data, 'excluded_codes') and input_data.excluded_codes and 'codes' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list codes columns
            if df['codes'].apply(lambda x: isinstance(x, list)).any():
                # Keep rows where NONE of the excluded codes are present
                mask = df['codes'].apply(lambda x: 
                    isinstance(x, list) and not any(item in input_data.excluded_codes for item in x))
            else:
                # Keep rows where the code is not in excluded_codes
                mask = ~df['codes'].isin(input_data.excluded_codes)
            df = df[mask]
            print(f"Excluded codes filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in excluded_codes filtering: {e}")
    
        # Define a custom word filter function
    def custom_word_filter(word):
        # First normalize with word families
        word_lower = word.lower()
        if word_lower in word_to_base:
            word = word_to_base[word_lower]
        
        # Manual exclusion of common words that should be filtered
        manual_exclusions = {'got', 'get', 'just', 'like', 'many', 'much', 'very', 'really', 'make'}
        
        return (word.lower() not in stop_words and
                word.lower() not in manual_exclusions and
                len(word) > 2 and  # Exclude very short words
                not any(c.isdigit() for c in word) and  # Exclude words with numbers
                re.match(r'^[a-z]+$', word.lower()))  # Only pure alphabetic words
                
    # Check if seed words were provided in the input
    if hasattr(input_data, 'seed_words') and input_data.seed_words and input_data.seed_words.strip().lower() != "none":
        seed_input = input_data.seed_words.strip()

        if ":" in seed_input:
            use_group_label = True
            for part in seed_input.split(";"):
                part = part.strip()
                if ":" in part:
                    group_label, word_str = part.split(":", 1)
                    group_label = group_label.strip().lower()
                    words = [w.strip().lower() for w in word_str.split(",") if w.strip()]
                    seed_groups[group_label] = set(words)
                    seed_words.append(group_label)
                    print(f"Group mode: all {words} will be treated as '{group_label}'")
                else:
                    individuals = [w.strip().lower() for w in part.split(",") if w.strip()]
                    seed_words.extend(individuals)
                    print(f"Individual mode: adding {individuals}")
        else:
            seed_words = [w.strip().lower() for w in seed_input.split(",") if w.strip()]
            print(f"Pure individual word mode: using {seed_words}")
        if use_group_label:
                sentences = [replace_group_words(text, seed_groups) for text in sentences]
    else:
        # Process sentences to get word frequencies for auto-selection of top words
        print("WARNING: No seed words provided or 'NONE' specified. Using top frequent words as seeds... ")
        excluded_words = stop_words.union(set(WORD_FAMILIES.keys()))
        all_tokens = []
        for sentence in sentences:
            tokens = tokenize_and_filter([sentence], stop_list=stop_words,
                                        lemmatize=True,
                                        cross_pos_normalize=input_data.cross_pos_normalize)
            filtered_tokens = [token.lower() for token in tokens if custom_word_filter(token)]
            all_tokens.extend(filtered_tokens)
        word_counts = Counter(all_tokens)
        top_words = [word for word, _ in word_counts.most_common(30)
                    if word.lower() not in excluded_words][:min(10, len(word_counts))]
        seed_words = top_words
        print(f"Top frequent words as seeds: {seed_words}")

    start = time.time()

    # Clean and normalize seed words, but they're already preprocessed with word families
    print(f"Original seed words before cleaning: {seed_words}")
    clean_seed_words = clean_words(seed_words)
    print(f"Seed words after cleaning: {clean_seed_words}")
    clean_seeds = clean_words(seed_words)
    seed_words = normalize_words(clean_seeds, stop_words, lemmatize=True, cross_pos_normalize=input_data.cross_pos_normalize)
    seed_words = list(set(seed_words))
    print("Final normalized seed words:", seed_words)

    # choose context source: keep stop-words only for RoBERTa
    if input_data.clustering_method == 1:          # 1 = RoBERTa
        sentences_for_embedding = sentences        # full context
    else:                                          # 2-4 = Jaccard/PMI/TF-IDF
        sentences_for_embedding = filtered_sentences

    word_embeddings, similarity_matrix, co_occurrence_matrix = train_embedding(
        sentences_for_embedding,
        context_window = input_data.window_size,
        stop_list  = stop_words, 
        seed_words = seed_words, 
        clustering_method  = input_data.clustering_method,
        num_words = input_data.num_words, 
        lemmatize = True, 
        min_word_frequency = input_data.min_word_frequency,
        reuse_clusterings  = input_data.reuse_clusterings,
        cross_pos_normalize= getattr(input_data, 'cross_pos_normalize', False),
        distance_metric = getattr(input_data, 'distance_metric', 'default'),
        custom_word_filter = custom_word_filter
    )

    elapsed_time = time.time() - start
    print(f"Embedding generation completed in {elapsed_time:.1f} seconds")

    if word_embeddings is None:
        print("Error: Failed to generate embeddings. Please check your input data and parameters.")
        return

    # Generate t-SNE Dimensional Reduction Plot
    print("-----------------------------------------------")
    print("Generating t-SNE dimensional reduction plot...")
    try:
        plot_tsne_dimensional_reduction(
            word_embeddings=word_embeddings,
            similarity_matrix=similarity_matrix,
            co_occurrence_matrix=co_occurrence_matrix,
            clustering_method=input_data.clustering_method,
            seed_words=seed_words, 
            distance_metric=getattr(input_data, 'distance_metric', 'default')
        )
    except Exception as e:
        print(f"t-SNE plot error: {e}")
    
    print("\nAnalysis complete")
    print(f"\nNetwork visualization method: {input_data.clustering_method}")
    
    if input_data.clustering_method == 1:
        print("Method: RoBERTa – Shows semantic relationships based on contextual embeddings")
    elif input_data.clustering_method == 2:
        if input_data.distance_metric == "cosine":
            print("Method: Jaccard (cosine) – Uses context window vectors to compute cosine-based similarity between word usage patterns")
        elif input_data.distance_metric == "default":
            print("Method: Jaccard (default) – Uses binary co-occurrence counts within a context window to capture word overlap")
    elif input_data.clustering_method == 3:
        print("Method: PMI – Highlights statistically significant word associations based on pointwise mutual information")
    elif input_data.clustering_method == 4:
        if input_data.distance_metric == "cosine":
            print("Method: TF-IDF (cosine) – Uses TF-IDF-weighted context vectors to compute cosine similarity between words")
        elif input_data.distance_metric == "default":
            print("Method: TF-IDF (default) – Uses raw TF-IDF-weighted co-occurrence scores for word associations")

    

### 4. Word-based Heatmp and Semantic Network (no custom colors, no edge bundling)

This function generates a frequency-by-speaker heat-map plus a plain semantic-network graph—default blue nodes, gray edges, no category colours and no edge-bundling—so you get a quick, unbiased view of word overlap and overall connectivity.

In [None]:
# Function to Run the Pipeline
def run_heatmap_network_plain_pipeline(input_data):
    """
    Main function to run the semantic network analysis pipeline.
    
    Parameters:
    -----------
    input_data : VisualsInput
        Object containing all input parameters
    """
    # Build reverse mapping for word families
    word_to_base = {}
    for base_word, variants in WORD_FAMILIES.items():
        for variant in variants:
            word_to_base[variant.lower()] = base_word

    seed_groups, seed_words, use_group_label = {}, [], False
    auto_selected_seeds = False  # Flag to track if seeds were auto-selected

    df = pd.read_csv(input_data.filepath)
  
    # Normalize alternative column names if needed
    if 'text' not in df.columns:
        alternatives = [col for col in df.columns if 'text' in col.lower() or 'content' in col.lower() or 'body' in col.lower()]
        if alternatives:
            print(f"'text' column not found, using '{alternatives[0]}' instead.")
            df.rename(columns={alternatives[0]: 'text'}, inplace=True)
        else:
            print("Error: No suitable text column found.")
            return
    
    # We'll use the stop list instead of hardcoding additional common words
    additional_common_words = set()
            
    # Use the pre-loaded stopwords if available
    if hasattr(input_data, 'custom_stopwords') and input_data.custom_stopwords:
        stop_words = input_data.custom_stopwords
    else:
        # Load stop words the old way if not pre-loaded
        stop_words = manage_stop_list(input_data.stop_list, default_stop_words)
        # Add our additional common words to the stop list
        stop_words = stop_words.union(additional_common_words)

    
    # Fix list columns that may be stored as strings
    for col in ['data_group', 'codes']:
        if col in df.columns:
            # Check if first non-null value is a string that looks like a list
            sample = df[col].dropna().iloc[0] if not df[col].dropna().empty else None
            if isinstance(sample, str) and (sample.startswith('[') or ',' in sample):
                df[col] = df[col].apply(lambda x: eval(x) if isinstance(x, str) and x.strip() else 
                                       ([] if pd.isna(x) else [x]))
    
    # Apply Metadata Filters
    if input_data.projects and 'project' in df.columns:
        before = len(df)
        df = df[df['project'].isin(input_data.projects)]
        print(f"Project filter: {before} → {len(df)} rows")
    
    if input_data.data_groups and 'data_group' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list data_group columns
            if df['data_group'].apply(lambda x: isinstance(x, list)).any():
                mask = df['data_group'].apply(lambda x: 
                    isinstance(x, list) and any(item in input_data.data_groups for item in x))
            else:
                mask = df['data_group'].isin(input_data.data_groups)
            df = df[mask]
            print(f"Data group filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in data_group filtering: {e}")
            print(f"Sample data_group values: {df['data_group'].head()}")
    
    if input_data.codes and 'codes' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list codes columns
            if df['codes'].apply(lambda x: isinstance(x, list)).any():
                mask = df['codes'].apply(lambda x: 
                    isinstance(x, list) and any(item in input_data.codes for item in x))
            else:
                mask = df['codes'].isin(input_data.codes)
            df = df[mask]
            print(f"Codes filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in codes filtering: {e}")
            print(f"Sample codes values: {df['codes'].head()}")
    
    if len(df) == 0:
        print("Error: All rows were filtered out. Please check your filter criteria.")
        return
    
    sentences = df['text'].dropna().tolist()
    print(f"Final dataset: {len(sentences)} text segments ready for processing")

    # Add filtered sentences tracking here
    filtered_sentences = []
    for sentence in sentences:
        if not isinstance(sentence, str):
            print(f"Not a string: type={type(sentence)}, value={sentence}")
        tokens = tokenize_and_filter([sentence],stop_list=stop_words, 
                                   lemmatize=True, 
                                   cross_pos_normalize=input_data.cross_pos_normalize)
        if tokens:  # Only keep sentences that have tokens after filtering
            filtered_sentences.append(sentence)
    print(f"After filtering: {len(filtered_sentences)} valid text segments")
    

    # Filter out excluded codes if specified
    if hasattr(input_data, 'excluded_codes') and input_data.excluded_codes and 'codes' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list codes columns
            if df['codes'].apply(lambda x: isinstance(x, list)).any():
                # Keep rows where NONE of the excluded codes are present
                mask = df['codes'].apply(lambda x: 
                    isinstance(x, list) and not any(item in input_data.excluded_codes for item in x))
            else:
                # Keep rows where the code is not in excluded_codes
                mask = ~df['codes'].isin(input_data.excluded_codes)
            df = df[mask]
            print(f"Excluded codes filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in excluded_codes filtering: {e}")
    
        # Define a custom word filter function
    def custom_word_filter(word):
        # First normalize with word families
        word_lower = word.lower()
        if word_lower in word_to_base:
            word = word_to_base[word_lower]
        
        # Manual exclusion of common words that should be filtered
        manual_exclusions = {'got', 'get', 'just', 'like', 'many', 'much', 'very', 'really', 'make'}
        
        return (word.lower() not in stop_words and
                word.lower() not in manual_exclusions and
                len(word) > 2 and  # Exclude very short words
                not any(c.isdigit() for c in word) and  # Exclude words with numbers
                re.match(r'^[a-z]+$', word.lower()))  # Only pure alphabetic words
                
    # Check if seed words were provided in the input
    if hasattr(input_data, 'seed_words') and input_data.seed_words and input_data.seed_words.strip().lower() != "none":
        seed_input = input_data.seed_words.strip()

        if ":" in seed_input:
            use_group_label = True
            for part in seed_input.split(";"):
                part = part.strip()
                if ":" in part:
                    group_label, word_str = part.split(":", 1)
                    group_label = group_label.strip().lower()
                    words = [w.strip().lower() for w in word_str.split(",") if w.strip()]
                    seed_groups[group_label] = set(words)
                    seed_words.append(group_label)
                    print(f"Group mode: all {words} will be treated as '{group_label}'")
                else:
                    individuals = [w.strip().lower() for w in part.split(",") if w.strip()]
                    seed_words.extend(individuals)
                    print(f"Individual mode: adding {individuals}")
        else:
            seed_words = [w.strip().lower() for w in seed_input.split(",") if w.strip()]
            print(f"Pure individual word mode: using {seed_words}")
        if use_group_label:
                sentences = [replace_group_words(text, seed_groups) for text in sentences]
    else:
        # Process sentences to get word frequencies for auto-selection of top words
        print("WARNING: No seed words provided or 'NONE' specified. Using top frequent words as seeds... ")
        excluded_words = stop_words.union(set(WORD_FAMILIES.keys()))
        all_tokens = []
        for sentence in sentences:
            tokens = tokenize_and_filter([sentence], stop_list=stop_words,
                                        lemmatize=True,
                                        cross_pos_normalize=input_data.cross_pos_normalize)
            filtered_tokens = [token.lower() for token in tokens if custom_word_filter(token)]
            all_tokens.extend(filtered_tokens)
        word_counts = Counter(all_tokens)
        top_words = [word for word, _ in word_counts.most_common(30)
                    if word.lower() not in excluded_words][:min(10, len(word_counts))]
        seed_words = top_words
        print(f"Top frequent words as seeds: {seed_words}")
    
    start = time.time()

    # Clean and normalize seed words, but they're already preprocessed with word families
    print(f"Original seed words before cleaning: {seed_words}")
    clean_seed_words = clean_words(seed_words)
    print(f"Seed words after cleaning: {clean_seed_words}")
    clean_seeds = clean_words(seed_words)
    seed_words = normalize_words(clean_seeds, stop_words, lemmatize=True, cross_pos_normalize=input_data.cross_pos_normalize)
    seed_words = list(set(seed_words))
    print("Final normalized seed words:", seed_words)


    # choose context source: keep stop-words only for RoBERTa
    if input_data.clustering_method == 1:          # 1 = RoBERTa
        sentences_for_embedding = sentences        # full context
    else:                                          # 2-4 = Jaccard/PMI/TF-IDF
        sentences_for_embedding = filtered_sentences

    word_embeddings, similarity_matrix, co_occurrence_matrix = train_embedding(
        sentences_for_embedding,
        context_window = input_data.window_size, 
        stop_list  = stop_words, 
        seed_words = seed_words, 
        clustering_method  = input_data.clustering_method,
        num_words = input_data.num_words, 
        lemmatize = True, 
        min_word_frequency = input_data.min_word_frequency,
        reuse_clusterings  = input_data.reuse_clusterings,
        cross_pos_normalize= getattr(input_data, 'cross_pos_normalize', False),
        distance_metric = getattr(input_data, 'distance_metric', 'default'),
        custom_word_filter = custom_word_filter
    )

    elapsed_time = time.time() - start
    print(f"Embedding generation completed in {elapsed_time:.1f} seconds")

    if word_embeddings is None:
        print("Error: Failed to generate embeddings. Please check your input data and parameters.")
        return
    # Plot Similarity Heatmap
    
    print("Plotting similarity heatmap...")
    fig_heat = plot_heatmap(
        input_data.clustering_method, word_embeddings, 
        similarity_matrix, co_occurrence_matrix, input_data.distance_metric
    )

    filename = f"heatmap_plain_m{input_data.clustering_method}_{input_data.distance_metric}.png"
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    out_path = os.path.join(OUTPUT_DIR, filename)

    fig_heat.savefig(out_path, dpi=300, bbox_inches="tight")
    print(f"✔ [OK] Saved {out_path}")

    plt.show()

    print("-----------------------------------------------")
    print("Plotting semantic network (plain, no categories)...")
    fig_sn1 = plot_semantic_network(
        word_embeddings,
        [] if auto_selected_seeds else seed_words,
        input_data.clustering_method,
        similarity_matrix, co_occurrence_matrix,
        semantic_categories=None,               
        link_threshold = input_data.link_threshold,
        link_color_threshold= input_data.link_color_threshold,
        distance_metric=getattr(input_data, 'distance_metric', 'default'))
    filename = f"semantic_network_plain_m{input_data.clustering_method}_{input_data.distance_metric}.png"
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    out_path = os.path.join(OUTPUT_DIR, filename)

    fig_sn1.suptitle("Semantic Network (Plain)", fontsize=30, y=0.98, fontweight='bold')
    fig_sn1.subplots_adjust(top=0.95)

    fig_sn1.savefig(out_path, dpi=300, bbox_inches="tight")
    print(f"✔ [OK] Saved {out_path}")

    plt.show()
        # Add a second network visualization that removes seed nodes
    print("\nGenerating secondary network with seed nodes hidden...")
    
    # Filter out seed words from word embeddings and matrices
    non_seed_words = [word for word in word_embeddings.keys() if word not in seed_words]
    
    if len(non_seed_words) > 5:  # Only proceed if we have enough nodes to make a meaningful network
        non_seed_embeddings = {word: word_embeddings[word] for word in non_seed_words}
        
        # Create filtered similarity/co-occurrence matrices
        if similarity_matrix is not None:
            words = list(word_embeddings.keys())
            indices = [words.index(word) for word in non_seed_words]
            filtered_similarity = similarity_matrix[np.ix_(indices, indices)]
        else:
            filtered_similarity = None
            
        if co_occurrence_matrix is not None:
            words = list(word_embeddings.keys())
            indices = [words.index(word) for word in non_seed_words]
            filtered_co_occurrence = co_occurrence_matrix[np.ix_(indices, indices)]
        else:
            filtered_co_occurrence = None
        
        # Plot the filtered network
        print(f"Plotting secondary network with {len(non_seed_words)} nodes (seeds hidden)...")
        fig_sn2 = plot_semantic_network(
            non_seed_embeddings, [], 
            input_data.clustering_method, 
            filtered_similarity, filtered_co_occurrence, 
            semantic_categories=None,
            link_threshold=input_data.link_threshold,
            link_color_threshold=input_data.link_color_threshold,
            distance_metric=getattr(input_data, 'distance_metric', 'default')
        )
        filename = f"semantic_network_noseeds_m{input_data.clustering_method}_{input_data.distance_metric}.png"
        out_path = os.path.join(OUTPUT_DIR, filename)

        fig_sn2.suptitle("Semantic Network (Seeds Hidden)", fontsize=30, y=0.98, fontweight='bold')
        fig_sn2.subplots_adjust(top=0.95)

        fig_sn2.savefig(out_path, dpi=300, bbox_inches="tight")
        print(f"✔ [OK] Saved {out_path}")

        plt.show()
        
   
    # If custom coloring was used in the first visualization, also do it for the second
    if hasattr(input_data, 'custom_colors') and input_data.custom_colors and hasattr(input_data, 'semantic_categories'):
        semantic_categories = input_data.semantic_categories
    else:
        print("Not enough non-seed nodes to generate a meaningful secondary network.")
    
    # Print the list of actual nodes used in the network (excluding seeds)
    print("\nFinal nodes used in network (excluding seeds):")
    if 'non_seed_words' in locals() and len(non_seed_words) > 0:
        # Get frequencies for each node and sort by frequency (highest first)
        print(f"Total non-seed words: {len(non_seed_words)}")
        
        # Check if word_frequencies is defined, otherwise use the word_counts from earlier
        if 'word_frequencies' not in locals() and 'word_counts' in locals():
            word_frequencies = word_counts
        elif 'word_frequencies' not in locals():
            print("Warning: Word frequency information not available")
            # Just print the words without frequencies
            for word in sorted(non_seed_words):
                print(f"- {word}")
        else:
            # Create list of (word, frequency) pairs and sort by frequency
            freq_sorted_words = [(word, word_frequencies[word]) for word in non_seed_words if word in word_frequencies]
            freq_sorted_words.sort(key=lambda x: x[1], reverse=True)
            
            for word, freq in freq_sorted_words:
                print(f"- {word}: {freq}")
    else:
        print("No non-seed nodes were used in the network.uelse")
    
    # Print the number of filtered sentences used
    if 'filtered_sentences' in locals():
        print(f"\nTotal number of filtered sentences used: {len(filtered_sentences)}")
        if 'seed_words' in locals() and seed_words:
            # Enhanced seed word detection using word families
            seed_containing_sentences = 0
            for sentence in filtered_sentences:
                sentence_lower = sentence.lower()
                contains_seed = False
                for seed in seed_words:
                    # Check direct match
                    if seed.lower() in sentence_lower:
                        contains_seed = True
                        break
                    # Check word family variants
                    for family, variants in WORD_FAMILIES.items():
                        if seed.lower() == family.lower() or seed.lower() in [v.lower() for v in variants]:
                            if any(variant.lower() in sentence_lower for variant in variants):
                                contains_seed = True
                                break
                    if contains_seed:
                        break
                if contains_seed:
                    seed_containing_sentences += 1
            print(f"Number of filtered sentences containing seed words: {seed_containing_sentences}")
    
    print("\nAnalysis complete")
    print(f"\nNetwork visualization method: {input_data.clustering_method}")
    
    if input_data.clustering_method == 1:
        print("Method: RoBERTa – Shows semantic relationships based on contextual embeddings")
    elif input_data.clustering_method == 2:
        if input_data.distance_metric == "cosine":
            print("Method: Jaccard (cosine) – Uses context window vectors to compute cosine-based similarity between word usage patterns")
        elif input_data.distance_metric == "default":
            print("Method: Jaccard (default) – Uses binary co-occurrence counts within a context window to capture word overlap")
    elif input_data.clustering_method == 3:
        print("Method: PMI – Highlights statistically significant word associations based on pointwise mutual information")
    elif input_data.clustering_method == 4:
        if input_data.distance_metric == "cosine":
            print("Method: TF-IDF (cosine) – Uses TF-IDF-weighted context vectors to compute cosine similarity between words")
        elif input_data.distance_metric == "default":
            print("Method: TF-IDF (default) – Uses raw TF-IDF-weighted co-occurrence scores for word associations")


### 5. Word-based Heatmp and Semantic Network (custom colors, no edge bundling)

This function generates a per-speaker word-frequency heat-map + a semantic-network graph that uses your custom color palette for node groups (so themes pop out) but keeps simple straight/gray edges—no bundling—giving a quick colored overview of concept clusters without extra visual wiring.

In [None]:

# Function to Run the Pipeline
def run_heatmap_network_pipeline(input_data):
    """
    Main function to run the semantic network analysis pipeline.
    
    Parameters:
    -----------
    input_data : SemanticNetworkInput
        Object containing all input parameters
    """
    # Build reverse mapping for word families
    word_to_base = {}
    for base_word, variants in WORD_FAMILIES.items():
        for variant in variants:
            word_to_base[variant.lower()] = base_word

    seed_groups, seed_words, use_group_label = {}, [], False
    auto_selected_seeds = False  # Flag to track if seeds were auto-selected

    df = pd.read_csv(input_data.filepath)
    
    original_row_count = len(df)
    
    # Normalize alternative column names if needed
    if 'text' not in df.columns:
        alternatives = [col for col in df.columns if 'text' in col.lower() or 'content' in col.lower() or 'body' in col.lower()]
        if alternatives:
            print(f"'text' column not found, using '{alternatives[0]}' instead.")
            df.rename(columns={alternatives[0]: 'text'}, inplace=True)
        else:
            print("Error: No suitable text column found.")
            return
    
    # We'll use the stop list instead of hardcoding additional common words
    additional_common_words = set()
            
    # Use the pre-loaded stopwords if available
    if hasattr(input_data, 'custom_stopwords') and input_data.custom_stopwords:
        stop_words = input_data.custom_stopwords
    else:
        # Load stop words the old way if not pre-loaded
        stop_words = manage_stop_list(input_data.stop_list, default_stop_words)
        # Add our additional common words to the stop list
        stop_words = stop_words.union(additional_common_words)
    
    # Fix list columns that may be stored as strings
    for col in ['data_group', 'codes']:
        if col in df.columns:
            # Check if first non-null value is a string that looks like a list
            sample = df[col].dropna().iloc[0] if not df[col].dropna().empty else None
            if isinstance(sample, str) and (sample.startswith('[') or ',' in sample):
                df[col] = df[col].apply(lambda x: eval(x) if isinstance(x, str) and x.strip() else 
                                       ([] if pd.isna(x) else [x]))
    
    # Apply Metadata Filters
    if input_data.projects and 'project' in df.columns:
        before = len(df)
        df = df[df['project'].isin(input_data.projects)]
        print(f"Project filter: {before} → {len(df)} rows")
    
    if input_data.data_groups and 'data_group' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list data_group columns
            if df['data_group'].apply(lambda x: isinstance(x, list)).any():
                mask = df['data_group'].apply(lambda x: 
                    isinstance(x, list) and any(item in input_data.data_groups for item in x))
            else:
                mask = df['data_group'].isin(input_data.data_groups)
            df = df[mask]
            print(f"Data group filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in data_group filtering: {e}")
            print(f"Sample data_group values: {df['data_group'].head()}")
    
    if input_data.codes and 'codes' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list codes columns
            if df['codes'].apply(lambda x: isinstance(x, list)).any():
                mask = df['codes'].apply(lambda x: 
                    isinstance(x, list) and any(item in input_data.codes for item in x))
            else:
                mask = df['codes'].isin(input_data.codes)
            df = df[mask]
            print(f"Codes filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in codes filtering: {e}")
            print(f"Sample codes values: {df['codes'].head()}")
    
    if len(df) == 0:
        print("Error: All rows were filtered out. Please check your filter criteria.")
        return
    
    sentences = df['text'].dropna().tolist()
    print(f"Final dataset: {len(sentences)} text segments ready for processing")

    # Add filtered sentences tracking here
    filtered_sentences = []
    for sentence in sentences:
        if not isinstance(sentence, str):
            print(f"Not a string: type={type(sentence)}, value={sentence}")
        tokens = tokenize_and_filter([sentence],stop_list=stop_words, 
                                   lemmatize=True, 
                                   cross_pos_normalize=input_data.cross_pos_normalize)
        if tokens:  # Only keep sentences that have tokens after filtering
            filtered_sentences.append(sentence)
    print(f"After filtering: {len(filtered_sentences)} valid text segments")
    

    # Filter out excluded codes if specified
    if hasattr(input_data, 'excluded_codes') and input_data.excluded_codes and 'codes' in df.columns:
        before = len(df)
        try:
            # Handle both list and non-list codes columns
            if df['codes'].apply(lambda x: isinstance(x, list)).any():
                # Keep rows where NONE of the excluded codes are present
                mask = df['codes'].apply(lambda x: 
                    isinstance(x, list) and not any(item in input_data.excluded_codes for item in x))
            else:
                # Keep rows where the code is not in excluded_codes
                mask = ~df['codes'].isin(input_data.excluded_codes)
            df = df[mask]
            print(f"Excluded codes filter: {before} → {len(df)} rows")
        except Exception as e:
            print(f"Error in excluded_codes filtering: {e}")
    
        # Define a custom word filter function
    def custom_word_filter(word):
        # First normalize with word families
        word_lower = word.lower()
        if word_lower in word_to_base:
            word = word_to_base[word_lower]
        
        # Manual exclusion of common words that should be filtered
        manual_exclusions = {'got', 'get', 'just', 'like', 'many', 'much', 'very', 'really', 'make'}
        
        return (word.lower() not in stop_words and
                word.lower() not in manual_exclusions and
                len(word) > 2 and  # Exclude very short words
                not any(c.isdigit() for c in word) and  # Exclude words with numbers
                re.match(r'^[a-z]+$', word.lower()))  # Only pure alphabetic words
                
    # Check if seed words were provided in the input
    if hasattr(input_data, 'seed_words') and input_data.seed_words and input_data.seed_words.strip().lower() != "none":
        seed_input = input_data.seed_words.strip()

        if ":" in seed_input:
            use_group_label = True
            for part in seed_input.split(";"):
                part = part.strip()
                if ":" in part:
                    group_label, word_str = part.split(":", 1)
                    group_label = group_label.strip().lower()
                    words = [w.strip().lower() for w in word_str.split(",") if w.strip()]
                    seed_groups[group_label] = set(words)
                    seed_words.append(group_label)
                    print(f"Group mode: all {words} will be treated as '{group_label}'")
                else:
                    individuals = [w.strip().lower() for w in part.split(",") if w.strip()]
                    seed_words.extend(individuals)
                    print(f"Individual mode: adding {individuals}")
        else:
            seed_words = [w.strip().lower() for w in seed_input.split(",") if w.strip()]
            print(f"Pure individual word mode: using {seed_words}")
        if use_group_label:
                sentences = [replace_group_words(text, seed_groups) for text in sentences]
    else:
        # Process sentences to get word frequencies for auto-selection of top words
        print("WARNING: No seed words provided or 'NONE' specified. Using top frequent words as seeds... ")
        excluded_words = stop_words.union(set(WORD_FAMILIES.keys()))
        all_tokens = []
        for sentence in sentences:
            tokens = tokenize_and_filter([sentence], stop_list=stop_words,
                                        lemmatize=True,
                                        cross_pos_normalize=input_data.cross_pos_normalize)
            filtered_tokens = [token.lower() for token in tokens if custom_word_filter(token)]
            all_tokens.extend(filtered_tokens)
        word_counts = Counter(all_tokens)
        top_words = [word for word, _ in word_counts.most_common(30)
                    if word.lower() not in excluded_words][:min(10, len(word_counts))]
        seed_words = top_words
        print(f"Top frequent words as seeds: {seed_words}")

    start = time.time()

    # Clean and normalize seed words, but they're already preprocessed with word families
    print(f"Original seed words before cleaning: {seed_words}")
    clean_seed_words = clean_words(seed_words)
    print(f"Seed words after cleaning: {clean_seed_words}")
    clean_seeds = clean_words(seed_words)
    seed_words = normalize_words(clean_seeds, stop_words, lemmatize=True, cross_pos_normalize=input_data.cross_pos_normalize)
    seed_words = list(set(seed_words))
    print("Final normalized seed words:", seed_words)
    
    
    # choose context source: keep stop-words only for RoBERTa
    if input_data.clustering_method == 1:          # 1 = RoBERTa
        sentences_for_embedding = sentences        # full context
    else:                                          # 2-4 = Jaccard/PMI/TF-IDF
        sentences_for_embedding = filtered_sentences

    word_embeddings, similarity_matrix, co_occurrence_matrix = train_embedding(
        sentences_for_embedding,
        context_window = input_data.window_size, 
        stop_list  = stop_words, 
        seed_words = seed_words, 
        clustering_method  = input_data.clustering_method,
        num_words = input_data.num_words, 
        lemmatize = True, 
        min_word_frequency = input_data.min_word_frequency,
        reuse_clusterings  = input_data.reuse_clusterings,
        cross_pos_normalize= getattr(input_data, 'cross_pos_normalize', False),
        distance_metric = getattr(input_data, 'distance_metric', 'default'),
        custom_word_filter = custom_word_filter
    )

    elapsed_time = time.time() - start
    print(f"Embedding generation completed in {elapsed_time:.1f} seconds")

    if word_embeddings is None:
        print("Error: Failed to generate embeddings. Please check your input data and parameters.")
        return
    # Plot Similarity Heatmap
    
    print("Plotting similarity heatmap...")
    fig_heat = plot_heatmap(
        input_data.clustering_method, word_embeddings, 
        similarity_matrix, co_occurrence_matrix, input_data.distance_metric
    )


    print("-----------------------------------------------")
    print("Plotting semantic network (plain, no categories)...")
    fig_sn1 = plot_semantic_network(
        word_embeddings,
        [] if auto_selected_seeds else seed_words,
        input_data.clustering_method,
        similarity_matrix, co_occurrence_matrix,
        semantic_categories=None,               
        link_threshold = input_data.link_threshold,
        link_color_threshold= input_data.link_color_threshold,
        distance_metric=getattr(input_data, 'distance_metric', 'default'))
    
    filename = f"semantic_network_plain_m{input_data.clustering_method}_{input_data.distance_metric}.png"
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    out_path = os.path.join(OUTPUT_DIR, filename)

    fig_sn1.suptitle("Semantic Network (Plain)", fontsize=30, y=0.98, fontweight='bold')
    fig_sn1.subplots_adjust(top=0.95)

    fig_sn1.savefig(out_path, dpi=300, bbox_inches="tight")
    print(f"✔ [OK] Saved {out_path}")

    plt.show()
    

    
    # Coloured visualisation (only if custom colors are enabled) 
    if getattr(input_data, 'custom_colors', False):
        # Print the actual list of words used in the network so users can match for coloring
        print("\n Word list available for coloring:")
        print(sorted(list(word_embeddings.keys())))
        print("\nUsing predefined semantic categories for custom grouping")
        # Use seed_words instead of empty list for the colored visualization
        fig_sn2 = plot_semantic_network(
            word_embeddings,
            seed_words,
            input_data.clustering_method,
            similarity_matrix, co_occurrence_matrix,
            semantic_categories = input_data.semantic_categories,
            link_threshold     = input_data.link_threshold,
            link_color_threshold = input_data.link_color_threshold,
            distance_metric=getattr(input_data, 'distance_metric', 'default')
        )
        filename = f"semantic_network_customcolor_m{input_data.clustering_method}_{input_data.distance_metric}.png"
        os.makedirs(OUTPUT_DIR, exist_ok=True)
        out_path = os.path.join(OUTPUT_DIR, filename)

        fig_sn2.suptitle("Semantic Network (Custom Colors)", fontsize=30, y=0.98, fontweight='bold')
        fig_sn2.subplots_adjust(top=0.95)

        fig_sn2.savefig(out_path, dpi=300, bbox_inches="tight")
        print(f"✔ [OK] Saved {out_path}")

        plt.show()
    
    

    # Add a second network visualization that removes seed nodes
    print("\nGenerating secondary network with seed nodes hidden...")
    
    # Filter out seed words from word embeddings and matrices
    non_seed_words = [word for word in word_embeddings.keys() if word not in seed_words]
    
    if len(non_seed_words) > 5:  # Only proceed if we have enough nodes to make a meaningful network
        non_seed_embeddings = {word: word_embeddings[word] for word in non_seed_words}
        
        # Create filtered similarity/co-occurrence matrices
        if similarity_matrix is not None:
            words = list(word_embeddings.keys())
            indices = [words.index(word) for word in non_seed_words]
            filtered_similarity = similarity_matrix[np.ix_(indices, indices)]
        else:
            filtered_similarity = None
            
        if co_occurrence_matrix is not None:
            words = list(word_embeddings.keys())
            indices = [words.index(word) for word in non_seed_words]
            filtered_co_occurrence = co_occurrence_matrix[np.ix_(indices, indices)]
        else:
            filtered_co_occurrence = None
        
        # Plot the filtered network
        print(f"Plotting secondary network with {len(non_seed_words)} nodes (seeds hidden)...")
        plt.figure(figsize=(16, 12))
        fig_sn3 = plot_semantic_network(
            non_seed_embeddings, [], 
            input_data.clustering_method, 
            filtered_similarity, filtered_co_occurrence, 
            semantic_categories=None,
            link_threshold=input_data.link_threshold,
            link_color_threshold=input_data.link_color_threshold,
            distance_metric=getattr(input_data, 'distance_metric', 'default')
        )
        filename = f"semantic_network_noseeds_m{input_data.clustering_method}_{input_data.distance_metric}.png"
        out_path = os.path.join(OUTPUT_DIR, filename)
        fig_sn3.savefig(out_path, dpi=300, bbox_inches="tight")
        print(f"✔ [OK] Saved {out_path}")

        fig_sn3.suptitle("Semantic Network (Seeds Hidden)", fontsize=30, y=0.98, fontweight='bold')
        fig_sn3.subplots_adjust(top=0.95)
        plt.show()
        
        # If custom coloring was used in the first visualization, also do it for the second
        if hasattr(input_data, 'custom_colors') and input_data.custom_colors and hasattr(input_data, 'semantic_categories'):
            semantic_categories = input_data.semantic_categories
            print("\nGenerating secondary network with custom grouping (seeds hidden)...")
            fig_sn4 = plot_semantic_network(
                non_seed_embeddings, [], 
                input_data.clustering_method, 
                filtered_similarity, filtered_co_occurrence, 
                semantic_categories=semantic_categories,
                link_threshold=input_data.link_threshold,
                link_color_threshold=input_data.link_color_threshold,
                distance_metric=getattr(input_data, 'distance_metric', 'default')
            )
            filename = f"semantic_network_noseeds_customcolor_m{input_data.clustering_method}_{input_data.distance_metric}.png"
            out_path = os.path.join(OUTPUT_DIR, filename)
            fig_sn4.savefig(out_path, dpi=300, bbox_inches="tight")
            print(f"✔ [OK] Saved {out_path}")

            fig_sn4.suptitle("Semantic Network with Custom Grouping (Seeds Hidden)", fontsize=30, y=0.98, fontweight='bold')
            fig_sn4.subplots_adjust(top=0.95)
            plt.show()
    else:
        print("Not enough non-seed nodes to generate a meaningful secondary network.")
    
    # Print the list of actual nodes used in the network (excluding seeds)
    print("\nFinal nodes used in network (excluding seeds):")
    if 'non_seed_words' in locals() and len(non_seed_words) > 0:
        # Get frequencies for each node and sort by frequency (highest first)
        print(f"Total non-seed words: {len(non_seed_words)}")
        
        # Check if word_frequencies is defined, otherwise use the word_counts from earlier
        if 'word_frequencies' not in locals() and 'word_counts' in locals():
            word_frequencies = word_counts
        elif 'word_frequencies' not in locals():
            print("Warning: Word frequency information not available")
            # Just print the words without frequencies
            for word in sorted(non_seed_words):
                print(f"- {word}")
        else:
            # Create list of (word, frequency) pairs and sort by frequency
            freq_sorted_words = [(word, word_frequencies[word]) for word in non_seed_words if word in word_frequencies]
            freq_sorted_words.sort(key=lambda x: x[1], reverse=True)
            
            for word, freq in freq_sorted_words:
                print(f"- {word}: {freq}")
    else:
        print("No non-seed nodes were used in the network.uelse")
    
    # Print the number of filtered sentences used
    if 'filtered_sentences' in locals():
        print(f"\nTotal number of filtered sentences used: {len(filtered_sentences)}")
        if 'seed_words' in locals() and seed_words:
            # Enhanced seed word detection using word families
            seed_containing_sentences = 0
            for sentence in filtered_sentences:
                sentence_lower = sentence.lower()
                contains_seed = False
                for seed in seed_words:
                    # Check direct match
                    if seed.lower() in sentence_lower:
                        contains_seed = True
                        break
                    # Check word family variants
                    for family, variants in WORD_FAMILIES.items():
                        if seed.lower() == family.lower() or seed.lower() in [v.lower() for v in variants]:
                            if any(variant.lower() in sentence_lower for variant in variants):
                                contains_seed = True
                                break
                    if contains_seed:
                        break
                if contains_seed:
                    seed_containing_sentences += 1
            print(f"Number of filtered sentences containing seed words: {seed_containing_sentences}")
    
    print("\nAnalysis complete")
    print(f"\nNetwork visualization method: {input_data.clustering_method}")
    
    if input_data.clustering_method == 1:
        print("Method: RoBERTa – Shows semantic relationships based on contextual embeddings")
    elif input_data.clustering_method == 2:
        if input_data.distance_metric == "cosine":
            print("Method: Jaccard (cosine) – Uses context window vectors to compute cosine-based similarity between word usage patterns")
        elif input_data.distance_metric == "default":
            print("Method: Jaccard (default) – Uses binary co-occurrence counts within a context window to capture word overlap")
    elif input_data.clustering_method == 3:
        print("Method: PMI – Highlights statistically significant word associations based on pointwise mutual information")
    elif input_data.clustering_method == 4:
        if input_data.distance_metric == "cosine":
            print("Method: TF-IDF (cosine) – Uses TF-IDF-weighted context vectors to compute cosine similarity between words")
        elif input_data.distance_metric == "default":
            print("Method: TF-IDF (default) – Uses raw TF-IDF-weighted co-occurrence scores for word associations")


### 6. Code-based Heatmap 

This function builds & displays a heat-map of how often the top N interview “codes” appear together across transcripts—optionally runs hierarchical clustering so related codes sit next to each other—giving a quick visual of thematic overlap in the dataset.

In [None]:
class HeatmapInput(BaseModel):
    filepath: FilePath 
    num_codes: int = 10
    seed_codes: Optional[List[str]] = None  # ← New
    projects: Optional[List[str]] = None 
    data_groups: Optional[List[str]] = None
    clustered: bool = True

    @field_validator("num_codes")
    def validate_num_codes(cls, v):
        if v <= 0:
            raise ValueError("num_codes must be greater than 0")
        return v


@lru_cache(maxsize=10000)
def parse_string_list(value: Union[str, list, None]) -> List[str]:
    """
    Performance-optimized string-formatted list parser for code lists.
    
    Args:
        value: String, list or None containing codes
        
    Returns:
        List[str]: Cleaned and parsed list of codes
    """
    if pd.isna(value) or value == "" or value is None:
        return []
    
    if isinstance(value, list):
        return [str(item).lower().strip() for item in value if item]
        
    if isinstance(value, str):
        value = value.strip()
        
        if value in ["[]", "['']", '[""]', "nan", "NaN"]:
            return []
            
        try:
            if value.startswith("[") and value.endswith("]"):
                parsed = ast.literal_eval(value) 
                if isinstance(parsed, list):
                    return [str(item).lower().strip() for item in parsed if item and str(item).strip()]
        except (ValueError, SyntaxError):
            pass
            
        try:
            cleaned = value.strip("[]").replace("'", "").replace('"', "")
            if cleaned:
                items = [item.strip().lower() for item in cleaned.split(",")]
                return [item for item in items if item]
        except Exception:
            pass
    
    return []


def create_code_cooccurrence_heatmap(input_data: HeatmapInput):
    filepath = input_data.filepath
    num_codes = input_data.num_codes
    projects = input_data.projects
    data_groups = input_data.data_groups
    clustered = input_data.clustered

    df = pd.read_csv(filepath)

    # Filter by projects 
    if projects:
        df = df[df['project'].isin(projects)]

    # Filter by data groups
    if data_groups:
        df = df[df['data_group'].apply(lambda x: any(g in parse_string_list(x) for g in data_groups))]

    # Vectorized code parsing
    all_codes = []
    for codes in df['codes'].dropna():
        all_codes.extend(parse_string_list(codes))

    # Get top N most frequent codes
    if input_data.seed_codes:
        seed_set = set(code.lower().strip() for code in input_data.seed_codes)
        cooccurrence_counter = Counter()

        # Count co-occurring codes with seeds
        for codes in df['codes'].dropna():
            code_list = set(parse_string_list(codes))
            if seed_set & code_list:  # If any seed is in the list
                overlapping = code_list - seed_set
                cooccurrence_counter.update(overlapping)

        # Add top-N co-occurring codes
        top_overlap = [code for code, _ in cooccurrence_counter.most_common(num_codes)]
        selected_codes = list(seed_set) + top_overlap
        print(f"Selected codes: {selected_codes}")
    else:
        # Use top-N most frequent codes in corpus
        selected_codes = [code for code, _ in Counter(all_codes).most_common(num_codes)]
        print(f"Top {num_codes} most frequent codes: {selected_codes}")

    # Check if we have any codes to analyze
    if not selected_codes:
        print("No codes found to analyze. Please check your input data.")
        return

    # Co-occurrence matrix using vectorized operations
    cooc_matrix = np.zeros((len(selected_codes), len(selected_codes)))
    
    for codes in df['codes'].dropna():
        codes_set = set(parse_string_list(codes)).intersection(selected_codes)
        for i, code1 in enumerate(selected_codes):
            for j in range(i + 1, len(selected_codes)):
                code2 = selected_codes[j]
                if code1 in codes_set and code2 in codes_set:
                    cooc_matrix[i][j] += 1
                    cooc_matrix[j][i] += 1

    # Check if matrix is empty
    if np.all(cooc_matrix == 0):
        print("No co-occurrences found. Please check your input data.")
        return

    heatmap_df = pd.DataFrame(cooc_matrix, index=selected_codes, columns=selected_codes)
    plt.style.use('dark_background')
    plt.figure(figsize=(12, 10))

    
    if clustered:
        # Convert co-occurrence matrix to proper distance format for linkage
        row_linkage = hierarchy.linkage(pdist(heatmap_df), method='ward')
        col_linkage = hierarchy.linkage(pdist(heatmap_df.T), method='ward')
        
        g = sns.clustermap(heatmap_df,
                        annot=True,
                        fmt='g',
                        cmap='inferno',
                        row_linkage=row_linkage,
                        col_linkage=col_linkage,
                        figsize=(12, 10),
                        dendrogram_ratio=0.2,
                        colors_ratio=0.03)
        g.fig.patch.set_facecolor('black')
        g.ax_heatmap.set_facecolor('black')

        for item in [g.ax_row_dendrogram, g.ax_col_dendrogram]:
            item.set_facecolor('black')
            for c in item.collections:
                c.set_color('white')

        filename = f"code_heatmap_clustered_{num_codes}.png"
        out_path = os.path.join(OUTPUT_DIR, filename)
        g.savefig(out_path, dpi=300, bbox_inches="tight")
        print(f"✔ [OK] Saved clustered heatmap: {out_path}")
    else:
        plt.figure(figsize=(12, 10))
        ax = sns.heatmap(
            heatmap_df,
            annot=True,
            fmt='g',
            cmap='inferno'
        )
        ax.set_title('Code Co-occurrence Matrix')

        filename = f"code_heatmap_plain_{num_codes}.png"
        out_path = os.path.join(OUTPUT_DIR, filename)
        plt.savefig(out_path, dpi=300, bbox_inches="tight")
        print(f"✔ [OK] Saved plain heatmap: {out_path}")

    plt.tight_layout()
    plt.show()


# Data Overview 

In [None]:
# Data Overview and Preparation 

# Path to your cleaned data
DATA_PATH = CSV_PATH  # alias for clarity

# --- Schema and Required Fields ---
SCHEMA = {
    "project": str, # List project 
    "number": str, # Position information
    "reference": int, # Position information
    "text": str, # Content, critical field: must not be empty
    "document": str, # Data source, Critical field: must not be empty
    "old_codes": list[str], # Optional: codings, must be a list of strings
    "start_position": int, # Position information
    "end_position": int, # Position information
    "data_group": list[str], # Optional, to differentiate document sets: Must be a list of strings
    "text_length": int, # Optional: NLP info
    "word_count": int, # Optional: NLP info
    "doc_id": str, # Optional: NLP info, unique paragrah level identifier
    "codes": list[str] # critical for analyses with codes, Must be a list of strings
}

REQUIRED_FIELDS = ["text", "document", "project"]

# --- Safe List Conversion Function ---
def safe_convert_list(x):
    """Safe conversion of various formats to a list, handling edge cases."""
    try:
        if isinstance(x, float) and np.isnan(x):
            return []
        if x in (None, "", "nan", "NaN"):
            return []
        if isinstance(x, list):
            return [str(i).strip().strip("\"'[]") for i in x if str(i).strip()]

        x = str(x).strip()
        if x == "[]":
            return []

        x = x.replace("’", '"').replace("][", ",").replace("],['", '","').replace('""', '"')

        if not x.startswith("["):
            x = "[" + x
        if not x.endswith("]"):
            x = x + "]"

        try:
            items = json.loads(x)
        except Exception:
            items = x.strip("[]").split(",")
        return [str(i).strip().strip("\"'[]") for i in items if str(i).strip()]

    except Exception:
        print(f"Warning: Could not convert value: {x}")
        return []


def parse_list_compat(val):
    if pd.isna(val) or val in (None, "", "nan", "NaN"):
        return []
    if isinstance(val, list):
        return val
    if isinstance(val, str):
        try:
            if val.startswith("[") and val.endswith("]"):
                return json.loads(val)
        except Exception:
            pass
        return [v.strip() for v in val.strip("[]").replace("'", "").replace('"', "").split(",") if v.strip()]
    return [val]

# --- Load Data ---
df_clean = pd.read_csv(DATA_PATH)

# --- Required Fields ---
missing_required = [c for c in REQUIRED_FIELDS if c not in df_clean.columns]
if missing_required:
    print(f"Missing required fields: {missing_required}")

for col in REQUIRED_FIELDS:
    if col in df_clean.columns:
        n_empty = df_clean[col].isna().sum() + (df_clean[col] == "").sum()
        print(f"Empty '{col}’: {n_empty}")


# --- Show First 5 Records ---
print("\nFirst 5 records:")
pprint.pprint(df_clean.head(5).to_dict(orient="records"))

# --- Schema Match ---
print("\nSchema match by column:")
for col, expected in SCHEMA.items():
    if col not in df_clean.columns:
        print(f"{col}: MISSING")
    else:
        dtype = (df_clean[col].dropna().map(type).mode()[0] if not df_clean[col].dropna().empty else None)
        print(f"{col}: expected {expected.__name__}, found {dtype.__name__ if dtype else 'None'}")

# --- Apply List Conversion to Specific Columns ---
list_columns = ['old_codes', 'data_group', 'codes']
for col in list_columns:
    if col in df_clean.columns:
        print(f"\nConverting and sanitizing {col}...")
        df_clean[col] = df_clean[col].apply(safe_convert_list)

# --- Verify List Conversion ---
print("\nVerification of list columns:")
for col in list_columns:
    if col in df_clean.columns:
        invalid = df_clean[~df_clean[col].apply(lambda x: isinstance(x, list))].shape[0]
        print(f"{col}: {invalid} invalid entries")
        if invalid > 0:
            print("Sample of first invalid entry:")
            print(df_clean[~df_clean[col].apply(lambda x: isinstance(x, list))][col].iloc[0])
        else:
            print(f"Sample of cleaned {col}:")
            sample = df_clean[col].iloc[0] if len(df_clean) > 0 else []
            print(f"First entry: {sample}")

# --- Cleaning Summary ---
print("\nCleaning Summary:")
for col in list_columns:
    if col in df_clean.columns:
        empty = df_clean[col].apply(lambda x: len(x) == 0).sum()
        print(f"{col} → empty: {empty}/{len(df_clean)}")
        
# --- Add word_count column if needed ---
if "word_count" not in df_clean.columns and "text" in df_clean.columns:
    df_clean["word_count"] = df_clean["text"].str.split().str.len().fillna(0)

print("\nDataset overview:")
print(df_clean.info())

# --- Codes column: parse and show top codes ---
if 'codes' in df_clean.columns:
    codes_flat = df_clean['codes'].dropna().apply(safe_convert_list).explode()
    code_counts = codes_flat.value_counts().head(20)
    print("\nTop 20 codes by frequency:")
    print(code_counts)
else:
    print("No 'codes' column found in the dataframe")

# --- Count documents by project ---
if 'project' in df_clean.columns:
    if 'document' in df_clean.columns:
        project_doc_counts = df_clean.groupby('project')['document'].nunique()
    else:
        project_doc_counts = df_clean['project'].value_counts()
    print("\nNumber of documents by project:")
    print(project_doc_counts)

# --- Top values for data_group ---
if 'data_group' in df_clean.columns:
    dg_flat = df_clean['data_group'].dropna().apply(safe_convert_list).explode()
    dg_counts = dg_flat.value_counts().head(10)
    print("\nTop 10 data_group values by frequency:")
    print(dg_counts)
else:
    print("\nNo 'data_group' column found in the dataframe")

# --- Word Count ---
if 'word_count' in df_clean.columns:
    avg_word_count = df_clean['word_count'].mean()
    print(f"\nAverage word count in text: {avg_word_count:.2f}")
elif 'text' in df_clean.columns:
    df_clean['word_count'] = df_clean['text'].str.split().str.len().fillna(0)
    avg_word_count = df_clean['word_count'].mean()
    print(f"\nAverage word count in text: {avg_word_count:.2f}")
else:
    print("No 'text' column found in the dataframe")

# --- Descriptive Statistics ---
print("\nMissing Values Count:")
print(df_clean.isnull().sum())

numeric_cols = ['text_length', 'word_count']
print("\nDescriptive Statistics for Numeric Columns:")
for col in numeric_cols:
    if col in df_clean.columns:
        print(f"\nStats for {col}:")
        print(f"Count: {df_clean[col].count()}")
        print(f"Median: {df_clean[col].median():.2f}")
        print(f"Standard Deviation: {df_clean[col].std():.2f}")

print("\nDataset Overview:")
print(df_clean.info())

# --- Unique Projects ---
if 'project' in df_clean.columns:
    projects = df_clean['project'].unique()
    print("\nUnique Projects:")
    for project in projects:
        print(project)

print(df_clean.head(5))


# --- Load and Verify Saved File ---
output_file_dropbox = os.path.join(DATA_DIR, '1_cleaned_data.csv')
output_file_combined = os.path.join(BACKUP_DIR, '1_cleaned_data.csv')
print(f"\nCleaned data saved to: {output_file_combined}")
df_clean.to_csv(output_file_dropbox, index=False)
df_clean.to_csv(output_file_combined, index=False)
print(f"\nCleaned data saved to: {output_file_combined}")
df = pd.read_csv(output_file_combined)
if 'project' in df.columns:
    projects_loaded = df['project'].unique()
    print("\nUnique Projects (from loaded file):")
    for project in projects_loaded:
        print(project)

print(df.head(5))

# Basic Analytics Tool 

## 1. Wordcloud

This section contains the wordcloud execution block. this produces a wordcloud from the data loaded, showing words that come up more frequent in larger size. The heatmap allows color coding by user defined categories, that represent concepts or themes.

In [None]:
# Filter text based on project and terms
filtered_df = df.copy()
if 'project' in filtered_df.columns:
    filtered_df = filtered_df[filtered_df['project'] == 'oral_history'] # Adjust as needed 

if 'text' in filtered_df.columns:
    text_series = filtered_df['text'].fillna('').astype(str)
    mask = text_series.str.lower().apply(lambda x: any(term in x for term in ['education', 'learning', 'teaching', 'student', 'school', 'classroom', 'curriculum', 'academic']))
    text_series = text_series[mask]

# Categories 
cmap = cm.get_cmap("mako", 5)
categories = {
    "People & Relations": {
        "words": {
            "people", "student", "students", "teacher", "teachers", "professor", "professors",
            "advisor", "mentor", "classmates", "friends", "colleague", "colleagues",
            "mother", "mom", "father", "dad", "family", "kids", "children",
            "wife", "husband", "brother", "sister"
        },
        "color": cmap(0)
    },
    "Education & Career": {
        "words": {
            "school", "college", "university", "department", "campus",
            "course", "courses", "class", "classes", "curriculum", "degree",
            "major", "minor", "graduate", "graduated", "thesis", "exam",
            "research", "lab", "laboratory", "project", "projects",
            "engineering", "engineer", "physics", "mathematics",
            "industry", "company", "career", "job", "internship"
        },
        "color": cmap(1)
    },
    "Emotions & Cognition": {
        "words": {
            "think", "thought", "know", "understand", "learn", "learning",
            "decide", "decision", "believe", "remember", "idea", "ideas",
            "feel", "feeling", "feelings", "excited", "interested",
            "curious", "happy", "proud", "worried", "scared", "confused"
        },
        "color": cmap(2)
    },
    "Time / Duration": {
        "words": {
            "first", "second", "later", "since", "before", "after",
            "started", "start", "begin", "early", "late",
            "day", "days", "week", "weeks", "month", "months",
            "year", "years", "semester", "summer", "winter", "spring", "fall"
        },
        "color": cmap(3)
    },
    "Daily Activities & Work": {
        "words": {
            "work", "working", "teach", "teaching", "study", "studying",
            "read", "reading", "write", "writing", "talk", "talking",
            "meet", "meeting", "present", "presentation", "travel", "move",
            "build", "design", "make", "made", "fix", "support", "help"
        },
        "color": cmap(4)
    },
}
# Word cloud
generate_wordcloud(
    text_series=text_series,
    stopwords_path=STOP_LIST_FILE,
    title='Word Cloud',
    out_dir=OUTPUT_DIR,  # change if needed
    categories=categories
)

# Intermediate Analytics Tools 

This section contains all heatmap and network execuion blocks. 

### User Configuration for Intermediate Visualization Tool

Overall Embedding/Clustering

In [None]:
#---------------------------------------------------------------------------------
# Variables Preparation
#---------------------------------------------------------------------------------
"""
OVERVIEW:
----------------------
This script generates semantic networks based on the specified clustering method,
visualizing relationships between concepts in the text corpus. Heatmaps and t-SNE 
plots are also generated to help visualize the relationships.

USAGE INSTRUCTIONS:
----------------------
1. Configure the parameters in the CONFIG section below
2. Save this notebook, run each block
3. Alternatively, export to a Python script
4. Execute with: python semantic_network.py
5. This is what a user would tweak, or be prompted to enter in an interactive version.
"""

if __name__ == "__main__":

    # ========================= CONFIG =========================
    # Paths & Stop-list
    csv_path               = CSV_PATH # Your dataset path
    stop_list_path         = STOP_LIST_FILE
    use_custom_stoplist    = True

    # Core Analysis
    clustering_method = 3           # 1 = RoBERTa, 2 = Jaccard, 3 = PMI, 4 = TF-IDF
    distance_metric   = "cosine"   
    
# Note on clustering method and distance metric:
#
# 1. RoBERTa (clustering_method = 1)
#    - distance_metric is always "default" (ignored internally)
#    - Uses contextual embeddings directly
#
# 2. Jaccard (clustering_method = 2)
#    - "default": context-window overlap measured by Jaccard index
#                 (set-based, binary overlap score)
#    - "cosine" : context vectors built from co-occurrence counts,
#                 compared using cosine similarity (frequency-sensitive)
#
# 3. PMI (clustering_method = 3)
#    - "default": classic PMI co-occurrence scores
#    - "cosine" : PMI-weighted context vectors compared with cosine similarity
#                 (captures similarity of PMI distributions rather than raw scores)
#
# 4. TF-IDF (clustering_method = 4)
#    - "cosine" : standard TF-IDF vectors compared with cosine similarity
#                 (recommended; normalized similarity).
#    - "default": experimental overlap-based method; shared context tokens are
#                 weighted by their global TF-IDF scores (unnormalized)
#                 → Use with caution; included as an optional example

    window_size            = 20 # Context window size for co-occurrence
    num_words              = 25 # Max number of top frequent words to analyze
    min_word_frequency     = 2 # Ignore words that appear fewer times
    reuse_clusterings      = False # Whether to reuse saved clustering results if available


    # Preprocessing Filters
    cross_pos_normalize    = True # Normalize words across parts of speech (e.g., "learn", "learning", "learned" -> "learn")
    projects               = ["oral_history"]     # Filter by project names 
    data_groups            = ["interview"]   # Filter by data_groups 
    codes                  = ["background"]        # Analyse specific codes
    excluded_codes         = ['interviewer']  # Exclude these codes, removing 'interviewer' is important for NLP

    # Visualisation
    title                = "Semantic Network (all interviews, contextual embeddings)"
    link_threshold       = 0.50          
    link_color_threshold = 0.75   # set to 99 to remove black links       
    custom_colors = True                 

    # Seeds & Colours
    seed_words = "education: learning, teaching, student, school, classroom, curriculum, academic"

    # You can provide seed words in two formats:
    # 1. Grouped format (for custom node colors and labels):
    #       "GroupName1: word1, word2; GroupName2: word3, word4; ..."
    #    Example: 
    #       "Therapy: therapy, physical therapy; Caregivers: caregivers, caregiver; family"
    #    → 'family' will be treated as its own group, using its own color.
    #
    # 2. Flat list format (no grouping, default color):
    #       "dementia, therapy, caregivers"
    #    → All words will appear as individual highlighted nodes with default styling.
    
    # consider inductive readings, and analysis 
    cmap = cm.get_cmap("mako", 5)
    semantic_categories = {
    "People & Relations": {
        "words": {
            "people", "student", "students", "teacher", "teachers", "professor", "professors",
            "advisor", "mentor", "classmates", "friends", "colleague", "colleagues",
            "mother", "mom", "father", "dad", "family", "kids", "children",
            "wife", "husband", "brother", "sister"
        },
        "color": cmap(0)
    },
    "Education & Career": {
        "words": {
            "school", "college", "university", "department", "campus",
            "course", "courses", "class", "classes", "curriculum", "degree",
            "major", "minor", "graduate", "graduated", "thesis", "exam",
            "research", "lab", "laboratory", "project", "projects",
            "engineering", "engineer", "physics", "mathematics",
            "industry", "company", "career", "job", "internship"
        },
        "color": cmap(1)
    },
    "Emotions & Cognition": {
        "words": {
            "think", "thought", "know", "understand", "learn", "learning",
            "decide", "decision", "believe", "remember", "idea", "ideas",
            "feel", "feeling", "feelings", "excited", "interested",
            "curious", "happy", "proud", "worried", "scared", "confused",
            "reflect", "contemplate", "ponder", "analyze", "reason"
        },
        "color": cmap(2)
    },
    "Time / Duration": {
        "words": {
            "first", "second", "later", "since", "before", "after",
            "started", "start", "begin", "early", "late",
            "day", "days", "week", "weeks", "month", "months",
            "year", "years", "semester", "summer", "winter", "spring", "fall"
        },
        "color": cmap(3)
    },
    "Daily Activities & Work": {
        "words": {
            "work", "working", "teach", "teaching", "study", "studying",
            "read", "reading", "write", "writing", "talk", "talking",
            "meet", "meeting", "present", "presentation", "travel", "move",
            "build", "design", "make", "made", "fix", "support", "help",
            "eat", "sleep", "walk", "exercise", "commute", "cook", "clean",
            "shop", "plan", "organize", "schedule", "prepare", "attend"
        },
        "color": cmap(4)
    },
}  
    
    # Network Layout Parameters
    network_layout = "kamada-kawai"   # Options: "spring", "kamada-kawai", "circular", "shell"
    
    # Heatmap Parameters
    num_codes    = 7              # Expand up to this many codes for heatmap
    seed_codes   = ["background"] # Start from these seed codes (if None, top-N most frequent codes are used)
    clustered    = True           # If True, apply hierarchical clustering to rows/columns

    # Notes:
    # - If seed_codes is given → heatmap expands with top-N co-occurring codes. 
    # - If seed_codes is None   → heatmap uses top-N most frequent codes in corpus. 
    # - num_codes defines N in both cases.

### 2. Word-based heatmap

In [None]:
print("\nSemantic Network Analysis Tool\n" + "=" * 30)

try:
    # Build pydantic object
    params = VisualsInput(
        filepath            = csv_path,
        stop_list           = stop_list_path if use_custom_stoplist else None,
        num_words           = num_words,
        clustering_method   = clustering_method,
        distance_metric     = distance_metric,
        window_size         = window_size,
        min_word_frequency  = min_word_frequency,
        projects            = projects,
        data_groups         = data_groups,
        codes               = codes,
        cross_pos_normalize = cross_pos_normalize
    )

        

    # Attach non-schema extras
    setattr(params, "reuse_clusterings",    reuse_clusterings)
    setattr(params, "seed_words",           seed_words)
    setattr(params, "custom_colors",        custom_colors)
    setattr(params, "semantic_categories",  semantic_categories)
    setattr(params, "link_threshold",       link_threshold)
    setattr(params, "link_color_threshold", link_color_threshold)
    setattr(params, "excluded_codes",       excluded_codes)    


    # Creates: (1) plain, (2) coloured, (3) coloured+seed-hidden
    run_heatmap_pipeline(params)


except ValidationError as ve:
    print("\n⚠ Parameter error:\n", ve)
except Exception as e:
    print(f"\n⚠ Unexpected error: {e}")
finally:
    print("\nAnalysis process completed.")


### 3. tSNE plot 

In [None]:
print("\nSemantic Network Analysis Tool\n" + "=" * 30)

try:
    # Build pydantic object
    params = VisualsInput(
        filepath            = csv_path,
        stop_list           = stop_list_path if use_custom_stoplist else None,
        num_words           = num_words,
        clustering_method   = clustering_method,
        distance_metric     = distance_metric,
        window_size         = window_size,
        min_word_frequency  = min_word_frequency,
        projects            = projects,
        data_groups         = data_groups,
        codes               = codes,
        cross_pos_normalize = cross_pos_normalize
    )

        

    # Attach non-schema extras
    setattr(params, "reuse_clusterings",    reuse_clusterings)
    setattr(params, "seed_words",           seed_words)
    setattr(params, "custom_colors",        custom_colors)
    setattr(params, "semantic_categories",  semantic_categories)
    setattr(params, "link_threshold",       link_threshold)
    setattr(params, "link_color_threshold", link_color_threshold)
    setattr(params, "excluded_codes",       excluded_codes)    


    # Creates: (1) plain, (2) coloured, (3) coloured+seed-hidden
    run_tsne_pipeline(params)


except ValidationError as ve:
    print("\n⚠ Parameter error:\n", ve)
except Exception as e:
    print(f"\n⚠ Unexpected error: {e}")
finally:
    print("\nAnalysis process completed.")


### 4. Word-based Heatmap and Social Network 
(no custom colors, just blue nodes and yellow seeds, no edge bundling)

In [None]:
print("\nSemantic Network Analysis Tool\n" + "=" * 30)

try:
    # Build pydantic object
    params = VisualsInput(
        filepath            = csv_path,
        stop_list           = stop_list_path if use_custom_stoplist else None,
        num_words           = num_words,
        clustering_method   = clustering_method,
        distance_metric     = distance_metric,
        window_size         = window_size,
        min_word_frequency  = min_word_frequency,
        projects            = projects,
        data_groups         = data_groups,
        codes               = codes,
        cross_pos_normalize = cross_pos_normalize
    )

        

    # Attach non-schema extras
    setattr(params, "reuse_clusterings",    reuse_clusterings)
    setattr(params, "seed_words",           seed_words)
    setattr(params, "custom_colors",        custom_colors)
    setattr(params, "semantic_categories",  semantic_categories)
    setattr(params, "link_threshold",       link_threshold)
    setattr(params, "link_color_threshold", link_color_threshold)
    setattr(params, "excluded_codes",       excluded_codes)    


    # Creates: (1) plain, (2) coloured, (3) coloured+seed-hidden
    run_heatmap_network_plain_pipeline(params)


except ValidationError as ve:
    print("\n⚠ Parameter error:\n", ve)
except Exception as e:
    print(f"\n⚠ Unexpected error: {e}")
finally:
    print("\nAnalysis process completed.")


### 5. Word-based Heatmap and Social Network 
(custom colors, no edge bundling)

In [None]:
print("\nSemantic Network Analysis Tool\n" + "=" * 30)

try:
    # Build pydantic object
    params = VisualsInput(
        filepath            = csv_path,
        stop_list           = stop_list_path if use_custom_stoplist else None,
        num_words           = num_words,
        clustering_method   = clustering_method,
        distance_metric     = distance_metric,
        window_size         = window_size,
        min_word_frequency  = min_word_frequency,
        projects            = projects,
        data_groups         = data_groups,
        codes               = codes,
        cross_pos_normalize = cross_pos_normalize
    )

        

    # Attach non-schema extras
    setattr(params, "reuse_clusterings",    reuse_clusterings)
    setattr(params, "seed_words",           seed_words)
    setattr(params, "custom_colors",        custom_colors)
    setattr(params, "semantic_categories",  semantic_categories)
    setattr(params, "link_threshold",       link_threshold)
    setattr(params, "link_color_threshold", link_color_threshold)
    setattr(params, "excluded_codes",       excluded_codes)    


    # Creates: (1) plain, (2) coloured, (3) coloured+seed-hidden
    run_heatmap_network_pipeline(params)


except ValidationError as ve:
    print("\n⚠ Parameter error:\n", ve)
except Exception as e:
    print(f"\n⚠ Unexpected error: {e}")
finally:
    print("\nAnalysis process completed.")


### 6. Code-based Heatmap

In [None]:
input_data = HeatmapInput(
    filepath    = CSV_PATH,
    num_codes   = num_codes,
    seed_codes  = seed_codes,
    projects    = projects,
    data_groups = data_groups,
    clustered   = clustered
)

# Run the heatmap generator
create_code_cooccurrence_heatmap(input_data)

#  Advanced Analytic Tools

### User Configuration for Advanced Visualization Tool

This section contains advanced heatmap and network combined execution blocks.

In [None]:
#---------------------------------------------------------------------------------
# Variables Preparation
#---------------------------------------------------------------------------------
"""
OVERVIEW:
----------------------
This script generates semantic networks based on the specified clustering method,
visualizing relationships between concepts in the text corpus. Heatmaps and t-SNE 
plots are also generated to help visualize the relationships.

USAGE INSTRUCTIONS:
----------------------
1. Configure the parameters in the CONFIG section below
2. Save this notebook, run each block
3. Alternatively, export to a Python script
4. Execute with: python semantic_network.py
5. This is what a user would tweak, or be prompted to enter in an interactive version.
"""

if __name__ == "__main__":

    # ========================= CONFIG =========================
    # Paths & Stop-list
    csv_path               = CSV_PATH # Your dataset path
    stop_list_path         = STOP_LIST_FILE
    use_custom_stoplist    = True

    # Core Analysis
    clustering_method = 2           # 1 = RoBERTa, 2 = Jaccard, 3 = PMI, 4 = TF-IDF
    distance_metric   = "cosine"   
    
    # Note on clustering method and distance metric:
    # - If clustering_method == 1 (RoBERTa), distance_metric is always "default" (ignored internally)
    # - For clustering_method in [2, 3, 4], distance_metric can be:
    #     "default" → uses raw co-occurrence or weighted scores
    #     "cosine"  → uses context or TF-IDF vectors with cosine similarity

    window_size            = 20 # Context window size for co-occurrence
    num_words              = 25 # Max number of top frequent words to analyze
    min_word_frequency     = 2 # Ignore words that appear fewer times
    reuse_clusterings      = False # Whether to reuse saved clustering results if available


    # Preprocessing Filters
    cross_pos_normalize    = True # Normalize words across parts of speech (e.g., "learn", "learning", "learned" -> "learn")
    projects               = ["oral_history"]     # Filter by project names 
    data_groups            = ["interview"]   # Filter by data_groups 
    codes                  = ["background"]        # Analyse specific codes
    excluded_codes         = ['interviewer']  # Exclude these codes, removing 'interviewer' is important for NLP

    # Visualisation
    title                = "Semantic Network (all interviews, contextual embeddings)"
    link_threshold       = 0.50          
    link_color_threshold = 0.75          # set to 99 to remove black links
    custom_colors = True                 

    # Seeds & Colours
    seed_words = "education: learning, teaching, student, school, classroom, curriculum, academic"

    # You can provide seed words in two formats:
    # 1. Grouped format (for custom node colors and labels):
    #       "GroupName1: word1, word2; GroupName2: word3, word4; ..."
    #    Example: 
    #       "Therapy: therapy, physical therapy; Caregivers: caregivers, caregiver; family"
    #    → 'family' will be treated as its own group, using its own color.
    #
    # 2. Flat list format (no grouping, default color):
    #       "dementia, therapy, caregivers"
    #    → All words will appear as individual highlighted nodes with default styling.
    
    # consider inductive readings, and analysis 
    cmap = cm.get_cmap("mako", 5)
    semantic_categories = {
    "People & Relations": {
        "words": {
            "people", "student", "students", "teacher", "teachers", "professor", "professors",
            "advisor", "mentor", "classmates", "friends", "colleague", "colleagues",
            "mother", "mom", "father", "dad", "family", "kids", "children",
            "wife", "husband", "brother", "sister"
        },
        "color": cmap(0)
    },
    "Education & Career": {
        "words": {
            "school", "college", "university", "department", "campus",
            "course", "courses", "class", "classes", "curriculum", "degree",
            "major", "minor", "graduate", "graduated", "thesis", "exam",
            "research", "lab", "laboratory", "project", "projects",
            "engineering", "engineer", "physics", "mathematics",
            "industry", "company", "career", "job", "internship"
        },
        "color": cmap(1)
    },
    "Emotions & Cognition": {
        "words": {
            "think", "thought", "know", "understand", "learn", "learning",
            "decide", "decision", "believe", "remember", "idea", "ideas",
            "feel", "feeling", "feelings", "excited", "interested",
            "curious", "happy", "proud", "worried", "scared", "confused",
            "reflect", "contemplate", "ponder", "analyze", "reason"
        },
        "color": cmap(2)
    },
    "Time / Duration": {
        "words": {
            "first", "second", "later", "since", "before", "after",
            "started", "start", "begin", "early", "late",
            "day", "days", "week", "weeks", "month", "months",
            "year", "years", "semester", "summer", "winter", "spring", "fall"
        },
        "color": cmap(3)
    },
    "Daily Activities & Work": {
        "words": {
            "work", "working", "teach", "teaching", "study", "studying",
            "read", "reading", "write", "writing", "talk", "talking",
            "meet", "meeting", "present", "presentation", "travel", "move",
            "build", "design", "make", "made", "fix", "support", "help",
            "eat", "sleep", "walk", "exercise", "commute", "cook", "clean",
            "shop", "plan", "organize", "schedule", "prepare", "attend"
        },
        "color": cmap(4)
    },
}
     # Network Layout Parameters
    network_layout = "kamada-kawai"   # Options: "spring", "kamada-kawai", "circular", "shell"
    
    # Heatmap Parameters
    num_codes    = 7              # Expand up to this many codes for heatmap
    seed_codes   = ["background"] # Start from these seed codes (if None, top-N most frequent codes are used)
    clustered    = True           # If True, apply hierarchical clustering to rows/columns

    # Notes:
    # - If seed_codes is given → heatmap expands with top-N co-occurring codes. 
    # - If seed_codes is None   → heatmap uses top-N most frequent codes in corpus. 
    # - num_codes defines N in both cases.

## Full Visuals Version

Heatmap + tSNE + Social Semantic Network 

In [None]:
print("\nSemantic Network Analysis Tool\n" + "=" * 30)

try:
    # Build pydantic object
    params = VisualsInput(
        filepath            = csv_path,
        stop_list           = stop_list_path if use_custom_stoplist else None,
        num_words           = num_words,
        clustering_method   = clustering_method,
        distance_metric     = distance_metric,
        window_size         = window_size,
        min_word_frequency  = min_word_frequency,
        projects            = projects,
        data_groups         = data_groups,
        codes               = codes,
        cross_pos_normalize = cross_pos_normalize
    )

        

    # Attach non-schema extras
    setattr(params, "reuse_clusterings",    reuse_clusterings)
    setattr(params, "seed_words",           seed_words)
    setattr(params, "custom_colors",        custom_colors)
    setattr(params, "semantic_categories",  semantic_categories)
    setattr(params, "link_threshold",       link_threshold)
    setattr(params, "link_color_threshold", link_color_threshold)
    setattr(params, "excluded_codes",       excluded_codes)    


    # Creates: (1) plain, (2) coloured, (3) coloured+seed-hidden
    run_visuals_pipeline(params)


except ValidationError as ve:
    print("\n⚠ Parameter error:\n", ve)
except Exception as e:
    print(f"\n⚠ Unexpected error: {e}")
finally:
    print("\nAnalysis process completed.")


## Interactive Version 

In [None]:
def main():
    """
OVERVIEW:
----------------------
This tool allows you to generate a semantic network based on word embeddings 
or co-occurrence similarity. It supports multiple clustering methods (RoBERTa, 
Jaccard, PMI, TF-IDF), customizable filters, and dynamic seed word grouping.

USAGE INSTRUCTIONS:
----------------------

1. CSV File Selection:
   - On launch, the tool will ask whether to use the last CSV file you worked with.
   - If not, you can provide a new path or use the default.

2. Reuse Previous Parameters:
   - You’ll be asked whether to reuse previous analysis settings.
   - If 'y' and a config exists, it will skip all prompts and directly run using cached data.
   - If 'n' or no config exists, you'll go through the full interactive setup.

3. Stop Word List (Optional):
   - You can load a custom stop words file (one word per line).
   - If not provided, the default list will be used.

4. Clustering Method & Parameters:
   - Choose from: 1 = RoBERTa, 2 = Jaccard, 3 = PMI, 4 = TF-IDF.
   - For methods 2–4, specify distance metric (default or cosine) and context window size.

5. Analysis Filters:
   - Specify the number of words to include, minimum frequency, and metadata filters
     (projects, data groups, codes).

6. Seed Words:
   - You may enter structured seed groups (e.g. "Grief: grief, sorrow; Memory: memory, recall")
     or leave blank to use the default.

7. Output:
   - The tool generates:
     - A semantic network (with and without seed nodes)
     - A similarity heatmap
     - A t-SNE plot
   - All visualizations are saved to the `OUTPUT_DIR` directory.

Note:
- The system caches results based on parameter combinations and data hash.
- Settings are saved to 'last_run_config.json' and CSV path to 'last_csv_path.txt'.

"""
    try:
        print("Semantic Network Analysis Tool")
        print("==============================")

        # --- Default configuration ---
        default_csv_path           = CSV_PATH
        default_stop_list_path     = STOP_LIST_FILE
        default_window_size        = 20
        default_num_words          = 25
        default_min_word_frequency = 2
        default_excluded_codes    = ['interviewer'] # Exclude these codes, removing 'interviewer' is important for NLP
        default_seed_words         = "education: learning, teaching, student, school, classroom, curriculum, academic"
        cmap = cm.get_cmap("mako", 5)
        default_semantic_categories = {
    "People & Relations": {
        "words": {
            "people", "student", "students", "teacher", "teachers", "professor", "professors",
            "advisor", "mentor", "classmates", "friends", "colleague", "colleagues",
            "mother", "mom", "father", "dad", "family", "kids", "children",
            "wife", "husband", "brother", "sister"
        },
        "color": cmap(0)
    },
    "Education & Career": {
        "words": {
            "school", "college", "university", "department", "campus",
            "course", "courses", "class", "classes", "curriculum", "degree",
            "major", "minor", "graduate", "graduated", "thesis", "exam",
            "research", "lab", "laboratory", "project", "projects",
            "engineering", "engineer", "physics", "mathematics",
            "industry", "company", "career", "job", "internship"
        },
        "color": cmap(1)
    },
    "Emotions & Cognition": {
        "words": {
            "think", "thought", "know", "understand", "learn", "learning",
            "decide", "decision", "believe", "remember", "idea", "ideas",
            "feel", "feeling", "feelings", "excited", "interested",
            "curious", "happy", "proud", "worried", "scared", "confused"
        },
        "color": cmap(2)
    },
    "Time / Duration": {
        "words": {
            "first", "second", "later", "since", "before", "after",
            "started", "start", "begin", "early", "late",
            "day", "days", "week", "weeks", "month", "months",
            "year", "years", "semester", "summer", "winter", "spring", "fall"
        },
        "color": cmap(3)
    },
    "Daily Activities & Work": {
        "words": {
            "work", "working", "teach", "teaching", "study", "studying",
            "read", "reading", "write", "writing", "talk", "talking",
            "meet", "meeting", "present", "presentation", "travel", "move",
            "build", "design", "make", "made", "fix", "support", "help"
        },
        "color": cmap(4)
    },
}

        default_title  = "Semantic Network: (contextual embeddings)"
        default_link_threshold     = 0.5
        default_link_color_thresh  = 0.75
        
        # Check for previous CSV file
        previous_csv_path = ""
        if os.path.exists(LAST_CSV_PATH):
            try:
                with open(LAST_CSV_PATH, 'r') as f:
                    previous_csv_path = f.read().strip()
                if os.path.exists(previous_csv_path):
                    use_last = ask_yes_no(f"Use previous CSV file? ({previous_csv_path})", default=False)
                    if use_last:
                        csv_path = previous_csv_path
                    else:
                        csv_path = ask_path(f"Enter CSV file path", default_csv_path)
                else:
                    print(f"Previous CSV file not found: {previous_csv_path}")
                    csv_path = ask_path(f"Enter CSV file path", default_csv_path)
            except Exception as e:
                print(f"Error reading previous CSV path: {e}")
                csv_path = ask_path(f"Enter CSV file path", default_csv_path)
        else:
            csv_path = ask_path(f"Enter CSV file path", default_csv_path)

        
        # Validate CSV path
        if not os.path.exists(csv_path):
            print(f"Error: File {csv_path} not found")
            exit(1)

        # Save the current CSV path for next time
        try:
            with open(LAST_CSV_PATH, 'w') as f:
                f.write(csv_path)
            print(f"Saved current CSV path for future use: {csv_path}")
        except Exception as e:
            print(f"Warning: Could not save CSV path for future use: {e}")

        # Ask about stop words
        use_stop_list = ask_yes_no(f"Use a custom stop-words file? (default path shown: {default_stop_list_path})", default=False)
        stop_list_path = None
        if use_stop_list:
            stop_list_path = ask_path("Enter STOP words file path", default_stop_list_path)
            # Validate stop list path
            if stop_list_path and not os.path.exists(stop_list_path):
                print(f"Warning: Stop words file {stop_list_path} not found. Using default stop words only.")
                stop_list_path = None
            else:
                # Debug stoplist
                with open(stop_list_path, 'r') as f:
                    custom_stopwords = [line.strip() for line in f if line.strip()]
                print(f"Loaded {len(custom_stopwords)} custom stopwords")
                print(f"First 10 stopwords: {custom_stopwords[:10]}")
                print(f"Stoplist file size: {os.path.getsize(stop_list_path)} bytes")

        reuse_clusterings = ask_yes_no("Reuse previous clusterings and parameters?", default=True)

        if reuse_clusterings: # reuse from the cache
            if os.path.exists(LAST_CONFIG_PATH):
                print("Loading previous parameters...")
                with open(LAST_CONFIG_PATH, 'r') as f:
                    saved_params = json.load(f)
                # Reconstruct VisualsInput from saved dictionary
                params = VisualsInput(**saved_params)
                setattr(params, "custom_colors", True)
                setattr(params, "link_threshold", default_link_threshold)
                setattr(params, "link_color_threshold", default_link_color_thresh)
                run_visuals_pipeline(params)
                return  # Exit early since pipeline is already run
        else:
            print("No saved parameter file found. Proceeding to manual input...")
            while True:
                try:
                    clustering_method = int(
                        input("Choose clustering method 1=RoBERTa, 2=Jaccard, 3=PMI, 4=TF-IDF [Default 1]: ").strip() or "1"
                        )
                    if clustering_method in (1, 2, 3, 4):
                        break
                    print("Please enter 1-4")
                except ValueError:
                    print("Please enter valid number")

            # distance metric & window
            distance_metric = "default"
            window_size = default_window_size
            if clustering_method in (2, 3, 4):
                metric_choice = input("Distance metric 1=Default, 2=Cosine [Default 1]: ").strip()
                distance_metric = "cosine" if metric_choice == "2" else "default"
                while True:
                    try:
                        window_size = int(
                            ask_path(f"Window size for co-occurrence", default_window_size)
                            )
                        if window_size > 0:
                            break
                        print("Window size must be positive.")
                    except ValueError:
                        print("Please enter valid number")

            # other analysis params
            num_words = int(ask_path(f"How many words to analyze?", default_num_words))
            min_word_frequency = int(ask_path(f"Min word frequency", default_min_word_frequency))
            cross_pos_normalize = ask_yes_no("Normalize words across POS?", default=True)
            seed_input = ask_path("Seed words (blank for default): ", default_seed_words)

            # filters
            projects_input = input(f"Projects comma-sep (Enter/space for skip): ").strip()
            projects = [p.strip() for p in projects_input.split(",") if p.strip()] if projects_input else None
            data_groups_input = input("Data groups comma-sep (Enter/space for skip): ").strip()
            data_groups = [g.strip() for g in data_groups_input.split(",") if g.strip()] if data_groups_input else None
            codes_input = input("Codes comma-sep (Enter/space for skip): ").strip()
            codes = [c.strip() for c in codes_input.split(",") if c.strip()] if codes_input else None
            excluded_codes_input = input("Codes you want to exclude from analysis comma-sep (Enter/space for skip): ").strip()
            excluded_codes = [e.strip() for e in excluded_codes_input.split(",") if e.strip()] if excluded_codes_input else default_excluded_codes

            # Consider inductive readings, and analysis 
            print(
                "\n▶ OPTIONAL — define semantic colour groups.\n"
                "  Format per group:  GroupName:{'color':'#HEX','words':[w1,w2]}\n"
                "  Separate multiple groups with a semicolon ‘;’ on one line.\n"
                "  Example:\n"
                "    Health:{'color':'#F8961E','words':['health','clinician']};\n"
                "    Roles :{'color':'#43AA8B','words':['parent','caregiver']}\n"
                "  Press <Enter> to keep the default shown below.\n"
                f"  Default = {default_semantic_categories}\n"
            )

            raw_sc = input("semantic_categories → ").strip()

            if not raw_sc:                                   # user kept the defaults
                semantic_categories = default_semantic_categories.copy()

            else:
                try:
                    semantic_categories = {}
                    # allow several groups separated by ';'
                    for block in filter(None, map(str.strip, raw_sc.split(';'))):
                        if ':' not in block:
                            raise ValueError(f"missing ':' in «{block}»")

                        label, dict_part = map(str.strip, block.split(':', 1))

                        # try JSON first, then Python literal
                        try:
                            cat_dict = json.loads(dict_part)
                        except json.JSONDecodeError:
                            cat_dict = ast.literal_eval(dict_part)

                        # minimal validation
                        if not isinstance(cat_dict, dict) or \
                        'color' not in cat_dict or 'words' not in cat_dict:
                            raise ValueError(
                                f"Group ‘{label}’ must contain 'color' and 'words' keys"
                            )

                        # normalise words to lower-case for matching
                        cat_dict['words'] = [w.lower() for w in cat_dict['words']]
                        semantic_categories[label] = cat_dict

                    if not semantic_categories:
                        raise ValueError("no valid groups parsed")

                except Exception as err:
                    print("⚠  Could not parse your custom groups. "
                        "Falling back to the default set.\n", err)
        semantic_categories = default_semantic_categories.copy()

        print("\nRunning analysis...\n")

        params = VisualsInput(
            filepath=csv_path,
            stop_list=stop_list_path,
            num_words=num_words,
            clustering_method=clustering_method,
            distance_metric=distance_metric,
            window_size=window_size,
            min_word_frequency=min_word_frequency,
            projects=projects,
            data_groups= data_groups,
            codes=codes,
            cross_pos_normalize=cross_pos_normalize
        )

        setattr(params, "reuse_clusterings", reuse_clusterings)
        setattr(params, "seed_words", seed_input)
        setattr(params, "custom_colors", True)
        setattr(params, "semantic_categories", semantic_categories)
        setattr(params, "link_threshold", default_link_threshold)
        setattr(params, "link_color_threshold", default_link_color_thresh)
        setattr(params, "excluded_codes", excluded_codes)

        try:
            with open(LAST_CONFIG_PATH, 'w') as f:
                json.dump(params.__dict__, f, indent=2)
            print(f"Parameters saved to {LAST_CONFIG_PATH}")
        except Exception as e:
            print(f"Warning: Failed to save parameters: {e}")

        run_visuals_pipeline(params)

    except ValidationError as ve:
        print("\n⚠ Parameter validation error:\n", ve)
    except Exception as e:
        print(f"\n⚠ Unexpected error: {e}")
    finally:
        print("\nAnalysis process completed.")

if __name__ == "__main__":
    main()