# Testing Toponymy Audit Functionality

This notebook demonstrates how to use the new audit functionality in Toponymy to compare intermediate results (keyphrases, exemplars, subtopics) with final LLM-generated topic names.

**Important**: Make sure to select the "Python 3.10 (toponymy-test)" kernel to run this notebook with all dependencies installed.

In [43]:
# Import required libraries
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer

# Import Toponymy and the new audit functions
from toponymy import Toponymy, ToponymyClusterer
from toponymy.audit import (
    create_audit_df,
    create_comparison_df,
    create_keyphrase_analysis_df,
    create_layer_summary_df,
    create_prompt_analysis_df,
    export_audit_excel,
    get_cluster_details,
    get_cluster_documents  # New function added
)

## 1. Load Sample Data

We'll use the 20-newsgroups dataset as shown in the README.

In [44]:
# Load the 20-newsgroups dataset with precomputed embeddings
print("Loading 20-newsgroups dataset...")
newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")

# Extract text, vectors, and map
text = newsgroups_df["post"].str.strip().values
document_vectors = np.stack(newsgroups_df["embedding"].values)
document_map = np.stack(newsgroups_df["map"].values)

print(f"Loaded {len(text)} documents")
print(f"Document vectors shape: {document_vectors.shape}")
print(f"Document map shape: {document_map.shape}")

Loading 20-newsgroups dataset...
Loaded 18170 documents
Document vectors shape: (18170, 768)
Document map shape: (18170, 2)


## 2. Initialize Models

Set up the embedding model and LLM wrapper.

In [None]:
# Initialize embedding model
print("Loading embedding model...")
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize OpenAI LLM wrapper
import os

# IMPORTANT: Replace with your actual OpenAI API key
# You can get one at: https://platform.openai.com/api-keys
#openai_api_key = "sk-YOUR-API-KEY-HERE"  



# Alternative: Load from environment variable (recommended for security)
#openai_api_key = os.getenv("OPENAI_API_KEY")

# Alternative: Load from a file
# with open("openai_key.txt", "r") as f:
#     openai_api_key = f.read().strip()

# Import OpenAI from the correct module
import toponymy.llm_wrappers
OpenAI = toponymy.llm_wrappers.OpenAINamer

# Create OpenAI wrapper
# Default model is gpt-4o-mini which is cost-effective for topic naming
llm = OpenAINamer(api_key=openai_api_key, model="gpt-4o-mini")

print("OpenAI model initialized!")
print(f"Using model: {llm.model}")
print("Ready to generate topic names using OpenAI API")

Loading embedding model...
OpenAI model initialized!
Using model: gpt-4o-mini
Ready to generate topic names using OpenAI API


## 3. Create and Fit Toponymy Model

We'll use a smaller subset for faster testing.

In [47]:
# Use a subset for faster testing
subset_size = 3000
text_subset = text[:subset_size]
vectors_subset = document_vectors[:subset_size]
map_subset = document_map[:subset_size]

print(f"Using subset of {subset_size} documents for testing")

Using subset of 3000 documents for testing


In [48]:
# Estimate API costs before running
# GPT-4o-mini pricing (as of 2024): $0.15 per 1M input tokens, $0.60 per 1M output tokens

# Rough estimation
estimated_clusters = 20  # Approximate number of clusters across all layers
estimated_tokens_per_prompt = 1000  # Each prompt with keyphrases and exemplars
estimated_output_tokens = 20  # Topic names are short

total_input_tokens = estimated_clusters * estimated_tokens_per_prompt
total_output_tokens = estimated_clusters * estimated_output_tokens

# Calculate costs (prices in USD)
input_cost = (total_input_tokens / 1_000_000) * 0.15
output_cost = (total_output_tokens / 1_000_000) * 0.60
total_cost = input_cost + output_cost

print(f"Estimated OpenAI API costs for this test:")
print(f"  Input tokens: ~{total_input_tokens:,} (${input_cost:.4f})")
print(f"  Output tokens: ~{total_output_tokens:,} (${output_cost:.4f})")
print(f"  Total estimated cost: ${total_cost:.4f}")
print(f"\nNote: Actual costs may vary. Using subset_size={subset_size} documents")

Estimated OpenAI API costs for this test:
  Input tokens: ~20,000 ($0.0030)
  Output tokens: ~400 ($0.0002)
  Total estimated cost: $0.0032

Note: Actual costs may vary. Using subset_size=3000 documents


In [49]:
# Initialize clusterer
# ToponymyClusterer doesn't have max_clusters parameter, adjust min_clusters and base_min_cluster_size instead
clusterer = ToponymyClusterer(min_clusters=3, base_min_cluster_size=10)

# Initialize Toponymy
topic_model = Toponymy(
    llm_wrapper=llm,
    text_embedding_model=embedding_model,
    clusterer=clusterer,
    object_description="newsgroup posts",
    corpus_description="20-newsgroups dataset",
    exemplar_delimiters=["<EXAMPLE_POST>\n", "\n</EXAMPLE_POST>\n\n"],
)

print("Fitting Toponymy model...")
print("This will make API calls to OpenAI - costs will apply!")
topic_model.fit(text_subset, vectors_subset, map_subset)
print("Model fitted!")

Fitting Toponymy model...
This will make API calls to OpenAI - costs will apply!
Layer 0 found 80 clusters
Layer 1 found 26 clusters
Layer 2 found 10 clusters
Layer 3 found 4 clusters



Selecting facility_location exemplars:   0%|                                                                                                                                                  | 0/80 [00:00<?, ?cluster/s][A
Selecting facility_location exemplars: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:00<00:00, 799.91cluster/s][A
  return forward_call(*args, **kwargs)
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1255/1255 [00:28<00:00, 44.05it/s]
Building topic names by layer:   0%|                                                                                                                                                             | 0/4 [00:00<?, ?layer/s]
Generating informative keyphrases:   0%|                                      

Model fitted!





In [50]:
# Show basic model information
print(f"Number of layers: {len(topic_model.cluster_layers_)}")
for i, layer in enumerate(topic_model.cluster_layers_):
    n_clusters = len(np.unique(layer.cluster_labels)) - 1  # Exclude -1
    print(f"Layer {i}: {n_clusters} clusters")

Number of layers: 4
Layer 0: 80 clusters
Layer 1: 26 clusters
Layer 2: 10 clusters
Layer 3: 4 clusters


## 4. Test Audit Functions

Now let's test the various audit functions to see intermediate vs final results.

### 4.1 Layer Summary

In [51]:
# Get overall layer summary
layer_summary = create_layer_summary_df(topic_model)
print("Layer Summary:")
layer_summary

Layer Summary:


Unnamed: 0,layer,num_clusters,avg_cluster_size,min_cluster_size,max_cluster_size,unique_topic_names,duplicate_topic_names,has_subtopics
0,0,80,21.925,10,75,80,0,True
1,1,26,72.307692,36,146,26,0,True
2,2,10,192.1,118,371,10,0,True
3,3,4,579.75,295,1354,4,0,True


### 4.2 Comparison View - Intermediate vs Final Results

In [52]:
# Show comparison for layer 0 (most detailed layer)
comparison_df = create_comparison_df(topic_model, layer_index=1)
print("\nComparison of Intermediate vs LLM Results (Layer 0, first 10 clusters):")
comparison_df.head(10)


Comparison of Intermediate vs LLM Results (Layer 0, first 10 clusters):


Unnamed: 0,Cluster ID,Document Count,Extracted Keyphrases (Top 5),Exemplar Count,Child Subtopics,Final LLM Topic Name
0,0,121,"hockey, game, team, playoffs, players",8,"Comprehensive Analysis of NHL Game Outcomes, S...",NHL Game Strategies and Player Analysis
1,1,45,"x-soviet armenian government, turkish, muslim ...",8,Armenian-Turkish Conflict and Historical Revis...,Armenian-Turkish Conflict and Historical Revis...
2,2,126,"patients, chronic, sci med, msg, intellect and...",8,Health Effects of MSG and Glutamate in Food Co...,Chronic Health Issues and Dietary Effects
3,3,47,"braves, baseball, catcher, hitter, pinch hit",8,In-Depth Analysis of Phillies and Mariners Pla...,In-Depth Analysis of Phillies and Mariners Pla...
4,4,37,"stats, total baseball, alomar, career, swing",8,Comprehensive Evaluation of MLB Player Metrics...,In-depth Analysis of Baseball Player Statistics
5,5,146,"encryption scheme, clipper chip, public key, k...",8,Mailing List Management and FAQ Discussions in...,Encryption Controversies and Government Survei...
6,6,40,"pulse dialing, box and faceplate screw, ground...",8,Telecommunications Equipment and Dialing Techn...,Telecommunications Equipment and Dialing Techn...
7,7,78,"state of israel, arabs, palestinians, jews, ce...",8,Israeli-Palestinian Conflict and Historical Qu...,Israeli-Palestinian Conflict and Holocaust Dis...
8,8,36,"orbit, spacecraft, solar, jupiter, pluto",8,Planetary Exploration Missions and Spacecraft ...,Planetary Exploration Missions and Spacecraft ...
9,9,48,"space station redesign, moon, space shuttle, s...",8,NASA Space Station Redesign and Lunar Mission ...,NASA Space Station and Astronomy Challenges


In [53]:
# Show comparison for a higher layer (more general topics)
if len(topic_model.cluster_layers_) > 2:
    comparison_df_layer2 = create_comparison_df(topic_model, layer_index=2)
    print("\nComparison for Layer 2 (broader topics):")
    display(comparison_df_layer2.head(10))


Comparison for Layer 2 (broader topics):


Unnamed: 0,Cluster ID,Document Count,Extracted Keyphrases (Top 5),Exemplar Count,Child Subtopics,Final LLM Topic Name
0,0,140,"baseball, hitter, hit, braves, players",8,In-Depth Analysis of Phillies and Mariners Pla...,Baseball Player Statistics Analysis
1,1,121,"hockey, team, playoffs, maple leafs, players",8,"Comprehensive Analysis of NHL Game Outcomes, S...",NHL Game Strategies and Player Analysis
2,2,126,"medical, chronic, treatment, intellect and geb...",8,Health Effects of MSG and Glutamate in Food Co...,Chronic Health Issues and Dietary Effects
3,3,191,"encryption scheme, clipper chip, security, alg...",8,Telecommunications Equipment and Dialing Techn...,Encryption and Surveillance Issues
4,4,118,"space station, moon, orbit, solar, mission",8,Planetary Exploration Missions and Spacecraft ...,Space Exploration and Missions
5,5,285,"car, bike, oil, engine, honda",8,Chemical Solutions and Practical Applications ...,Automotive and Motorcycling Discussions
6,6,371,"video card, ram, windows, dos, local bus",8,"Comic Books, CDs, and Movie Merchandise for Sa...",Computer Hardware Discussions
7,7,295,"window manager, lib x11, motif, server, widget",8,"Discussions on Windows NT, OS/2, and Applicati...",X11 Graphics and Window Management
8,8,123,"god, bible, sin, christians, heaven",8,Christian Perspectives on Homosexuality and Sc...,Theological Debates on Biblical Interpretation
9,9,151,"existence of god, alt atheism, faith, truth, b...",8,"Atheism, Religion, and Their Influence on Musi...",Atheism and Religious Debate


### 4.3 Detailed Audit DataFrame

In [54]:
# Get detailed audit for layer 0
audit_df = create_audit_df(topic_model, layer_index=0)

# Show selected columns for first few clusters
print("\nDetailed Audit Information:")
audit_df[['cluster_id', 'num_documents', 'num_keyphrases', 'num_exemplars', 
          'top_5_keyphrases', 'llm_topic_name']].head()


Detailed Audit Information:


Unnamed: 0,cluster_id,num_documents,num_keyphrases,num_exemplars,top_5_keyphrases,llm_topic_name
0,0,25,16,8,"espn bought, zones abc, hockey games, coverage...",NHL Broadcast Coverage and ESPN Scheduling Dis...
1,1,18,16,8,"det, nhl, saves, 38, stl","Comprehensive Analysis of NHL Game Outcomes, S..."
2,2,25,16,8,"win game, hawks power, blues, belfour, penalty",In-Depth Discussions on NHL Game Strategies an...
3,3,10,16,8,"copy protection schemes, pirates, progs, new e...",Discussion on Software Copy Protection and Cir...
4,4,10,16,8,"goalies, tommy soderstrom, curtis joseph, havi...",Discussions on Goalie Equipment and Masks in H...


In [55]:
# Look at a specific cluster in detail
cluster_to_inspect = 0
cluster_audit = audit_df[audit_df['cluster_id'] == cluster_to_inspect].iloc[0]

print(f"\nDetailed view of Cluster {cluster_to_inspect}:")
print(f"Number of documents: {cluster_audit['num_documents']}")
print(f"\nKeyphrases extracted: {cluster_audit['num_keyphrases']}")
print(f"Top 5 keyphrases: {cluster_audit['top_5_keyphrases']}")
print(f"\nNumber of exemplars: {cluster_audit['num_exemplars']}")
print(f"\nFirst exemplar preview:")
print(cluster_audit['first_exemplar'])
print(f"\nFinal LLM topic name: '{cluster_audit['llm_topic_name']}'")


Detailed view of Cluster 0:
Number of documents: 25

Keyphrases extracted: 16
Top 5 keyphrases: espn bought, zones abc, hockey games, coverage sunday april, cbc

Number of exemplars: 8

First exemplar preview:
Oh to be back in the good old days when I lived in Florida (Florida for
Petes sake!!) and could watch hockey every night as ESPN and USA alternated
coverage nights. Oh well I guess it would be too simple for the home office
to look back into their past to solve a problem in the present...

Of course...

Final LLM topic name: 'NHL Broadcast Coverage and ESPN Scheduling Discussions'


### 4.4 Keyphrase Analysis

In [56]:
# Analyze how keyphrases relate to topic names
keyphrase_df = create_keyphrase_analysis_df(topic_model, layer_index=0)

# Show some examples
print("\nKeyphrase to Topic Name Mapping (first 20):")
display(keyphrase_df.head(20))

# Summary statistics
keyphrase_usage = keyphrase_df['keyphrase_in_topic'].value_counts()
print(f"\nKeyphrase usage in topic names:")
print(f"Keyphrases appearing in topic names: {keyphrase_usage.get(True, 0)}")
print(f"Keyphrases NOT in topic names: {keyphrase_usage.get(False, 0)}")
print(f"Percentage of keyphrases used: {keyphrase_usage.get(True, 0) / len(keyphrase_df) * 100:.1f}%")


Keyphrase to Topic Name Mapping (first 20):


Unnamed: 0,cluster_id,keyphrase,llm_topic_name,keyphrase_in_topic
0,0,espn bought,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
1,0,zones abc,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
2,0,hockey games,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
3,0,coverage sunday april,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
4,0,cbc,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
5,0,pm mdt,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
6,0,nationwide tsn,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
7,0,playoffs,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
8,0,satellite dish,NHL Broadcast Coverage and ESPN Scheduling Dis...,False
9,0,gary thorne,NHL Broadcast Coverage and ESPN Scheduling Dis...,False



Keyphrase usage in topic names:
Keyphrases appearing in topic names: 66
Keyphrases NOT in topic names: 734
Percentage of keyphrases used: 8.2%


### 4.5 Prompt Analysis

In [57]:
# Analyze prompts sent to LLM
prompt_df = create_prompt_analysis_df(topic_model)

print("\nPrompt Analysis:")
display(prompt_df.head(10))

# Summary statistics
print("\nPrompt Statistics:")
print(f"Average prompt length: {prompt_df['prompt_length'].mean():.0f} characters")
print(f"Min prompt length: {prompt_df['prompt_length'].min()}")
print(f"Max prompt length: {prompt_df['prompt_length'].max()}")
print(f"\nAverage topic name length: {prompt_df['topic_name_length'].mean():.1f} characters")


Prompt Analysis:


Unnamed: 0,layer,cluster_id,prompt_length,num_exemplars_in_prompt,num_keyphrases_in_prompt,topic_name,topic_name_length
0,0,0,6010,16,10,NHL Broadcast Coverage and ESPN Scheduling Dis...,54
1,0,1,15330,16,10,"Comprehensive Analysis of NHL Game Outcomes, S...",74
2,0,2,7429,16,10,In-Depth Discussions on NHL Game Strategies an...,75
3,0,3,11481,16,10,Discussion on Software Copy Protection and Cir...,67
4,0,4,4353,16,10,Discussions on Goalie Equipment and Masks in H...,51
5,0,5,7673,16,10,NHL European Player Representation and Managem...,58
6,0,6,12104,16,10,NCAA Hockey Discussions and Coaching Changes i...,66
7,0,7,6773,16,10,"NHL Trades, Strategies, and Player Prospects D...",56
8,0,8,13293,16,10,"Youth Drug Use, Health Insurance, and Poverty ...",64
9,0,9,8161,16,10,Engineering Challenges in Fossil Fuel Plants a...,77



Prompt Statistics:
Average prompt length: 9864 characters
Min prompt length: 39
Max prompt length: 33518

Average topic name length: 56.6 characters


### 4.6 Get Full Details for a Specific Cluster

In [58]:
# Get complete details for a specific cluster including the actual prompt
cluster_details = get_cluster_details(topic_model, layer_index=0, cluster_id=0)

print("\nComplete Cluster Details:")
print(f"Layer: {cluster_details['layer']}")
print(f"Cluster ID: {cluster_details['cluster_id']}")
print(f"Number of documents: {cluster_details['num_documents']}")
print(f"Topic name: {cluster_details['topic_name']}")

print(f"\nKeyphrases (top 10): {cluster_details['keyphrases'][:10]}")
print(f"\nNumber of exemplars: {len(cluster_details['exemplars'])}")

if 'prompt' in cluster_details:
    print(f"\nPrompt sent to LLM (first 1000 chars):")
    print(str(cluster_details['prompt'])[:1000] + "...")


Complete Cluster Details:
Layer: 0
Cluster ID: 0
Number of documents: 25
Topic name: NHL Broadcast Coverage and ESPN Scheduling Discussions

Keyphrases (top 10): ['espn bought', 'zones abc', 'hockey games', 'coverage sunday april', 'cbc', 'pm mdt', 'nationwide tsn', 'playoffs', 'satellite dish', 'gary thorne']

Number of exemplars: 8

Prompt sent to LLM (first 1000 chars):
{'system': '\nYou are an expert at classifying newsgroup posts from 20-newsgroups dataset into topics.\nYour task is to analyze information about a group of newsgroup posts and assign a domain expert level (8 to 15 word) name to this group.\nThe response must be in JSON formatted as {"topic_name":<NAME>, "topic_specificity":<SCORE>}\nwhere NAME is the topic name you generate and SCORE is a float value between 0.0 and 1.0,\nrepresenting how specific and well-defined the topic name is given the input information.\nA score of 1.0 means a perfectly descriptive and specific name, while 0.0 would be a completely generic o

### 4.7 Export to Excel

In [59]:
# # Export all audit data to Excel file
# export_audit_excel(topic_model, "toponymy_audit_report.xlsx")
# print("\nExcel file created with multiple sheets containing all audit data!")

## 5. Visualize Audit Results

Let's create some visualizations to better understand the relationship between intermediate and final results.

In [19]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
# Visualize cluster sizes across layers
fig, ax = plt.subplots(figsize=(10, 6))

layer_summary = create_layer_summary_df(topic_model)
x = layer_summary['layer']
y = layer_summary['num_clusters']

ax.bar(x, y)
ax.set_xlabel('Layer')
ax.set_ylabel('Number of Clusters')
ax.set_title('Number of Clusters per Layer')
ax.set_xticks(x)

# Add value labels on bars
for i, v in enumerate(y):
    ax.text(i, v + 0.5, str(v), ha='center')

plt.tight_layout()
plt.show()

In [None]:
# Analyze relationship between prompt length and topic name length
prompt_df = create_prompt_analysis_df(topic_model)

fig, ax = plt.subplots(figsize=(10, 6))
scatter = ax.scatter(prompt_df['prompt_length'], 
                    prompt_df['topic_name_length'],
                    c=prompt_df['layer'],
                    cmap='viridis',
                    alpha=0.6,
                    s=50)

ax.set_xlabel('Prompt Length (characters)')
ax.set_ylabel('Topic Name Length (characters)')
ax.set_title('Prompt Length vs Topic Name Length')

# Add colorbar
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Layer')

plt.tight_layout()
plt.show()

In [None]:
# Show distribution of cluster sizes
audit_df = create_audit_df(topic_model, layer_index=0)

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(audit_df['num_documents'], bins=30, edgecolor='black')
ax.set_xlabel('Number of Documents')
ax.set_ylabel('Number of Clusters')
ax.set_title('Distribution of Cluster Sizes (Layer 0)')
ax.axvline(audit_df['num_documents'].mean(), color='red', linestyle='--', 
           label=f'Mean: {audit_df["num_documents"].mean():.1f}')
ax.legend()

plt.tight_layout()
plt.show()

## 6. Example Use Cases for Auditing

Here are some practical examples of how to use the audit functionality:

### 6.1 Find clusters where keyphrases don't match topic names

In [60]:
# Find potential mismatches
audit_df = create_audit_df(topic_model, layer_index=0)

# Check if any top keyphrases appear in the topic name
mismatches = []
for _, row in audit_df.iterrows():
    top_keyphrases = row['top_5_keyphrases'].lower().split(', ')
    topic_name = row['llm_topic_name'].lower()
    
    # Check if any keyphrase appears in topic name
    match_found = any(kp in topic_name for kp in top_keyphrases if kp)
    
    if not match_found and row['num_documents'] > 10:  # Only consider clusters with >10 docs
        mismatches.append({
            'cluster_id': row['cluster_id'],
            'keyphrases': row['top_5_keyphrases'],
            'topic_name': row['llm_topic_name'],
            'num_docs': row['num_documents']
        })

print(f"Found {len(mismatches)} clusters where top keyphrases don't appear in topic name:")
for m in mismatches[:5]:  # Show first 5
    print(f"\nCluster {m['cluster_id']} ({m['num_docs']} docs):")
    print(f"  Keyphrases: {m['keyphrases']}")
    print(f"  Topic name: {m['topic_name']}")

Found 38 clusters where top keyphrases don't appear in topic name:

Cluster 0 (25 docs):
  Keyphrases: espn bought, zones abc, hockey games, coverage sunday april, cbc
  Topic name: NHL Broadcast Coverage and ESPN Scheduling Discussions

Cluster 2 (25 docs):
  Keyphrases: win game, hawks power, blues, belfour, penalty
  Topic name: In-Depth Discussions on NHL Game Strategies and Player Performance Insights

Cluster 6 (12 docs):
  Keyphrases: mike keenan, ncaa division hockey, maine black bears, laughter senator mitchell, flyers
  Topic name: NCAA Hockey Discussions and Coaching Changes in Professional Teams

Cluster 7 (15 docs):
  Keyphrases: daigle, sharks, sather, kozlov, previous gm
  Topic name: NHL Trades, Strategies, and Player Prospects Discussions

Cluster 9 (13 docs):
  Keyphrases: fossil plants, radar detector detectors, cooling towers, hotwell pumps, condenser tubes
  Topic name: Engineering Challenges in Fossil Fuel Plants and Radar Detection Technologies


### 6.2 Find duplicate topic names

In [61]:
# Find duplicate topic names within each layer
for layer_idx in range(len(topic_model.cluster_layers_)):
    layer = topic_model.cluster_layers_[layer_idx]
    topic_counts = pd.Series(layer.topic_names).value_counts()
    duplicates = topic_counts[topic_counts > 1]
    
    if len(duplicates) > 0:
        print(f"\nLayer {layer_idx} - Duplicate topic names:")
        for topic, count in duplicates.items():
            print(f"  '{topic}' appears {count} times")

### 6.3 Analyze topic quality by cluster size

In [62]:
# Compare small vs large clusters
audit_df = create_audit_df(topic_model, layer_index=0)

# Define small and large clusters
small_clusters = audit_df[audit_df['num_documents'] < 20]
large_clusters = audit_df[audit_df['num_documents'] > 100]

print("Small clusters (<20 docs):")
print(f"  Count: {len(small_clusters)}")
print(f"  Avg keyphrases: {small_clusters['num_keyphrases'].mean():.1f}")
print(f"  Avg topic name length: {small_clusters['llm_topic_name'].str.len().mean():.1f}")

print("\nLarge clusters (>100 docs):")
print(f"  Count: {len(large_clusters)}")
print(f"  Avg keyphrases: {large_clusters['num_keyphrases'].mean():.1f}")
print(f"  Avg topic name length: {large_clusters['llm_topic_name'].str.len().mean():.1f}")

print("\nExample small cluster topics:")
for _, row in small_clusters.head(3).iterrows():
    print(f"  - {row['llm_topic_name']} ({row['num_documents']} docs)")

print("\nExample large cluster topics:")
for _, row in large_clusters.head(3).iterrows():
    print(f"  - {row['llm_topic_name']} ({row['num_documents']} docs)")

Small clusters (<20 docs):
  Count: 46
  Avg keyphrases: 16.0
  Avg topic name length: 63.6

Large clusters (>100 docs):
  Count: 0
  Avg keyphrases: nan
  Avg topic name length: nan

Example small cluster topics:
  - Comprehensive Analysis of NHL Game Outcomes, Standings, and Player Metrics (18 docs)
  - Discussion on Software Copy Protection and Circumvention Techniques (10 docs)
  - Discussions on Goalie Equipment and Masks in Hockey (10 docs)

Example large cluster topics:


## Summary

The audit functionality provides several ways to inspect and validate Toponymy results:

1. **`create_audit_df()`** - Get comprehensive audit data as a DataFrame
2. **`create_comparison_df()`** - Simple side-by-side comparison of intermediate vs final results
3. **`create_keyphrase_analysis_df()`** - Analyze how keyphrases relate to topic names
4. **`create_layer_summary_df()`** - High-level statistics for each layer
5. **`create_prompt_analysis_df()`** - Analyze prompt characteristics
6. **`get_cluster_details()`** - Get all details for a specific cluster
7. **`export_audit_excel()`** - Export everything to Excel for further analysis

This allows you to:
- Verify that topic names accurately reflect the extracted keyphrases and exemplars
- Identify potential issues or mismatches
- Understand how the LLM is interpreting the intermediate data
- Debug and improve your topic modeling pipeline

In [63]:
comparison_df = create_comparison_df(topic_model, layer_index=0)
print("\nComparison of Intermediate vs LLM Results (Layer 0, first 10 clusters):")
comparison_df.head(10)


Comparison of Intermediate vs LLM Results (Layer 0, first 10 clusters):


Unnamed: 0,Cluster ID,Document Count,Extracted Keyphrases (Top 5),Exemplar Count,Child Subtopics,Final LLM Topic Name
0,0,25,"espn bought, zones abc, hockey games, coverage...",8,,NHL Broadcast Coverage and ESPN Scheduling Dis...
1,1,18,"det, nhl, saves, 38, stl",8,,"Comprehensive Analysis of NHL Game Outcomes, S..."
2,2,25,"win game, hawks power, blues, belfour, penalty",8,,In-Depth Discussions on NHL Game Strategies an...
3,3,10,"copy protection schemes, pirates, progs, new e...",8,,Discussion on Software Copy Protection and Cir...
4,4,10,"goalies, tommy soderstrom, curtis joseph, havi...",8,,Discussions on Goalie Equipment and Masks in H...
5,5,10,"nhl teams, classy, numbers of euros, names lik...",8,,NHL European Player Representation and Managem...
6,6,12,"mike keenan, ncaa division hockey, maine black...",8,,NCAA Hockey Discussions and Coaching Changes i...
7,7,15,"daigle, sharks, sather, kozlov, previous gm",8,,"NHL Trades, Strategies, and Player Prospects D..."
8,8,10,"health insurance plan providing, drugs, lsd an...",8,,"Youth Drug Use, Health Insurance, and Poverty ..."
9,9,13,"fossil plants, radar detector detectors, cooli...",8,,Engineering Challenges in Fossil Fuel Plants a...


In [64]:
# Save the original text subset for testing new functionality
original_texts = text_subset.tolist()

## 7. Testing New Document Traceability Features

The audit functionality now includes the ability to trace back to original documents. Let's test these new features:

### 7.1 Basic Document Indices (Always Included)

In [65]:
# Create audit DataFrame - document_indices are now always included
audit_with_indices = create_audit_df(topic_model, layer_index=0)

# Show document indices for each cluster
print("Document indices for each cluster (Layer 0):")
for idx, row in audit_with_indices.head(5).iterrows():
    print(f"\nCluster {row['cluster_id']}: {row['llm_topic_name']}")
    print(f"  Number of documents: {row['num_documents']}")
    print(f"  Document indices: {row['document_indices']}")
    
    # Show a few document texts from the indices
    print(f"  Sample documents:")
    for doc_idx in row['document_indices'][:2]:  # Show first 2 documents
        print(f"    [{doc_idx}] {original_texts[doc_idx][:80]}...")

Document indices for each cluster (Layer 0):

Cluster 0: NHL Broadcast Coverage and ESPN Scheduling Discussions
  Number of documents: 25
  Document indices: [7, 79, 103, 220, 649, 751, 765, 1086, 1118, 1239, 1358, 1435, 1649, 1701, 1718, 1799, 1934, 1998, 2212, 2508, 2565, 2642, 2755, 2811, 2935]
  Sample documents:
    [7] [stuff deleted]

Ok, here's the solution to your problem.  Move to Canada.  Yest...
    [79] Well I think whenever ESPN covers the game they do a wonderful job. But
   what ...

Cluster 1: Comprehensive Analysis of NHL Game Outcomes, Standings, and Player Metrics
  Number of documents: 18
  Document indices: [44, 76, 211, 370, 537, 609, 642, 794, 946, 950, 994, 1151, 1708, 2220, 2343, 2374, 2523, 2902]
  Sample documents:
    [44] Here are the NHL's alltime leaders in goals and points at the end of
the 1992-3 ...
    [76] Why not? I believe both the Devils and Islanders got 87 points.
Say for example,...

Cluster 2: In-Depth Discussions on NHL Game Strategies and P

### 7.2 Full Document Texts in Audit DataFrame

In [79]:
# Create audit with full document texts
audit_with_docs = create_audit_df(
    topic_model, 
    layer_index=1,
    include_all_docs=True,
    original_texts=original_texts
)

# Show the first cluster with all its documents
first_cluster = audit_with_docs.iloc[0]
print(f"Cluster 0: {first_cluster['llm_topic_name']}")
print(f"Keyphrases: {first_cluster['top_5_keyphrases']}")
print(f"\nAll {first_cluster['num_documents']} documents in this cluster:")

if 'document_texts' in first_cluster:
    for i, (idx, doc) in enumerate(zip(first_cluster['document_indices'], first_cluster['document_texts'])):
        print(f"\n[Document {idx}]:")
        print(doc[:200] + "..." if len(doc) > 200 else doc)
        if i >= 2:  # Show only first 3 documents
            print(f"\n... and {len(first_cluster['document_texts']) - 3} more documents")
            break

Cluster 0: NHL Game Strategies and Player Analysis
Keyphrases: hockey, game, team, playoffs, players

All 121 documents in this cluster:

[Document 0]:
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I...

[Document 8]:
Yeah, it's the second one.  And I believe that price too.  I've been trying
to get a good look at it on the Bruin-Sabre telecasts, and wow! does it ever
look good.  Whoever did that paint job knew wha...

[Document 24]:
I don't know the exact coverage in the states.  In Canada it is covered
by TSN, so maybe ESPN will grab their coverage!  I don't know!

As for the picks
Ottawa picks #1 which means it is almost 100% t...

... and 118 more documents


In [89]:
audit_with_docs

Unnamed: 0,layer,cluster_id,num_documents,top_5_keyphrases,all_keyphrases,num_keyphrases,num_exemplars,first_exemplar,subtopics_list,subtopics_text,prompt_preview,prompt_length,llm_topic_name,document_indices,document_texts
0,1,0,121,"hockey, game, team, playoffs, players","[hockey, game, team, playoffs, players, leafs,...",16,8,NHL RESULTS FOR GAMES PLAYED 4/15/93.\n\n-----...,"[Comprehensive Analysis of NHL Game Outcomes, ...","Comprehensive Analysis of NHL Game Outcomes, S...",{'system': '\nYou are an expert at classifying...,19817,NHL Game Strategies and Player Analysis,"[0, 8, 24, 44, 66, 76, 96, 130, 174, 211, 228,...",[I am sure some bashers of Pens fans are prett...
1,1,1,45,"x-soviet armenian government, turkish, muslim ...","[x-soviet armenian government, turkish, muslim...",16,8,Turkish Historical Revision <9305111942@zuma.U...,[Armenian-Turkish Conflict and Historical Revi...,Armenian-Turkish Conflict and Historical Revis...,[!SKIP!]: Armenian-Turkish Conflict and Histor...,78,Armenian-Turkish Conflict and Historical Revis...,"[2, 15, 64, 93, 112, 113, 182, 188, 265, 304, ...",[Finally you said what you dream about. Medite...
2,1,2,126,"patients, chronic, sci med, msg, intellect and...","[patients, chronic, sci med, msg, intellect an...",16,8,"Hate to wreck your elaborate theory, but Steve...",[Health Effects of MSG and Glutamate in Food C...,Health Effects of MSG and Glutamate in Food Co...,{'system': '\nYou are an expert at classifying...,9422,Chronic Health Issues and Dietary Effects,"[34, 164, 190, 233, 253, 263, 269, 303, 337, 3...","[According to a previous poster, one should se..."
3,1,3,47,"braves, baseball, catcher, hitter, pinch hit","[braves, baseball, catcher, hitter, pinch hit,...",16,8,HEY!!! All you Yankee fans who've been knockin...,[In-Depth Analysis of Phillies and Mariners Pl...,In-Depth Analysis of Phillies and Mariners Pla...,[!SKIP!]: In-Depth Analysis of Phillies and Ma...,78,In-Depth Analysis of Phillies and Mariners Pla...,"[33, 144, 171, 318, 371, 378, 458, 470, 802, 9...",[Be patient. He has a sore shoulder from crash...
4,1,4,37,"stats, total baseball, alomar, career, swing","[stats, total baseball, alomar, career, swing,...",16,8,"You're right: Thomas, Gonzalez, Sheffield, and...",[Comprehensive Evaluation of MLB Player Metric...,Comprehensive Evaluation of MLB Player Metrics...,{'system': '\nYou are an expert at classifying...,8872,In-depth Analysis of Baseball Player Statistics,"[60, 90, 116, 225, 379, 395, 497, 509, 557, 64...",[I don't buy this at all. I think things are ...
5,1,5,146,"encryption scheme, clipper chip, public key, k...","[encryption scheme, clipper chip, public key, ...",16,8,^^^^^^^^^^^^^^^^^^^\\n ...,[Mailing List Management and FAQ Discussions i...,Mailing List Management and FAQ Discussions in...,{'system': '\nYou are an expert at classifying...,22347,Encryption Controversies and Government Survei...,"[48, 53, 73, 133, 140, 141, 169, 215, 217, 224...",[You're blowing smoke. Qualcomm wants to sell...
6,1,6,40,"pulse dialing, box and faceplate screw, ground...","[pulse dialing, box and faceplate screw, groun...",16,8,"Hey Serdar,\n What nationality are y...",[Telecommunications Equipment and Dialing Tech...,Telecommunications Equipment and Dialing Techn...,[!SKIP!]: Telecommunications Equipment and Dia...,73,Telecommunications Equipment and Dialing Techn...,"[6, 49, 78, 88, 98, 198, 285, 334, 524, 568, 6...",[AE is in Dallas...try 214/241-6060 or 214/241...
7,1,7,78,"state of israel, arabs, palestinians, jews, ce...","[state of israel, arabs, palestinians, jews, c...",16,8,Elias Davidsson writes...\n \n \nED> The follo...,[Israeli-Palestinian Conflict and Historical Q...,Israeli-Palestinian Conflict and Historical Qu...,{'system': '\nYou are an expert at classifying...,22035,Israeli-Palestinian Conflict and Holocaust Dis...,"[102, 153, 274, 332, 339, 415, 510, 559, 567, ...",[In the same way in which antisemite means ant...
8,1,8,36,"orbit, spacecraft, solar, jupiter, pluto","[orbit, spacecraft, solar, jupiter, pluto, pro...",16,8,Archive-name: space/new_probes\nLast-modified:...,[Planetary Exploration Missions and Spacecraft...,Planetary Exploration Missions and Spacecraft ...,[!SKIP!]: Planetary Exploration Missions and S...,68,Planetary Exploration Missions and Spacecraft ...,"[14, 40, 210, 234, 296, 324, 479, 489, 698, 78...","[There is no notion of heliocentric, or even g..."
9,1,9,48,"space station redesign, moon, space shuttle, s...","[space station redesign, moon, space shuttle, ...",16,8,"In the April edition of ""One Small Step for a ...",[NASA Space Station Redesign and Lunar Mission...,NASA Space Station Redesign and Lunar Mission ...,{'system': '\nYou are an expert at classifying...,18020,NASA Space Station and Astronomy Challenges,"[110, 154, 311, 322, 375, 437, 506, 514, 581, ...","[""Space Station Redesign Leader Says Cost Goal..."


In [68]:
len(audit_with_docs['document_texts'][0])

25

### 7.3 Limited Documents per Cluster

In [69]:
# Create audit with limited documents (useful for large datasets)
audit_limited = create_audit_df(
    topic_model,
    layer_index=0,
    include_all_docs=True,
    max_docs_per_cluster=3,  # Only show first 3 documents
    original_texts=original_texts
)

print("Audit with limited documents per cluster:")
for idx, row in audit_limited.head(3).iterrows():
    print(f"\nCluster {row['cluster_id']}: {row['llm_topic_name']}")
    print(f"  Total documents in cluster: {row.get('total_docs_in_cluster', row['num_documents'])}")
    
    if 'document_sample' in row:
        print(f"  Showing first {len(row['document_sample'])} documents:")
        for i, doc in enumerate(row['document_sample']):
            print(f"    [{row['document_indices'][i]}] {doc[:100]}...")
    elif 'document_texts' in row:
        # If cluster has 3 or fewer docs, they'll be in document_texts
        print(f"  All {len(row['document_texts'])} documents:")
        for i, doc in enumerate(row['document_texts']):
            print(f"    [{row['document_indices'][i]}] {doc[:100]}...")

Audit with limited documents per cluster:

Cluster 0: NHL Broadcast Coverage and ESPN Scheduling Discussions
  Total documents in cluster: 25
  Showing first 3 documents:
    [7] [stuff deleted]

Ok, here's the solution to your problem.  Move to Canada.  Yesterday I was able
to ...
    [79] Well I think whenever ESPN covers the game they do a wonderful job. But
   what I don't understand i...
    [103] I haven't heard any news about ASN carrying any games but the local
cable station here in St. John's...

Cluster 1: Comprehensive Analysis of NHL Game Outcomes, Standings, and Player Metrics
  Total documents in cluster: 18
  Showing first 3 documents:
    [44] Here are the NHL's alltime leaders in goals and points at the end of
the 1992-3 season. Again, much ...
    [76] Why not? I believe both the Devils and Islanders got 87 points.
Say for example, another team had th...
    [211] [Much text deleted]

:   plus/minus ... it is the most misleading hockey stat available.

Not necess...



### 7.4 Using the get_cluster_documents Helper Function

In [70]:
# Import the new helper function
from toponymy.audit import get_cluster_documents

# Get all documents for a specific cluster
cluster_id = 2  # Choose a cluster to examine
cluster_docs = get_cluster_documents(
    topic_model, 
    layer_index=0, 
    cluster_id=cluster_id, 
    original_texts=original_texts
)

print(f"Cluster {cluster_id}: {topic_model.cluster_layers_[0].topic_names[cluster_id]}")
print(f"Total documents: {cluster_docs['total_count']}")
print(f"\nDocument indices: {cluster_docs['indices']}")
print(f"\nFirst 3 documents:")
for idx, text in zip(cluster_docs['indices'][:3], cluster_docs['texts'][:3]):
    print(f"\n[Document {idx}]:")
    print(text[:200] + "..." if len(text) > 200 else text)

Cluster 2: In-Depth Discussions on NHL Game Strategies and Player Performance Insights
Total documents: 25

Document indices: [0, 174, 574, 718, 850, 933, 972, 1149, 1254, 1263, 1577, 1606, 1620, 1689, 1749, 1813, 1842, 1974, 2083, 2206, 2607, 2613, 2768, 2826, 2989]

First 3 documents:

[Document 0]:
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I...

[Document 174]:
Dear Ulf,

	Would you possibly consider helpiMontreal Canadiens fans everywhere
by throwing a knee-check in the direction of Denis Savard during your upcoming
game against Montreal? We just can't seem...

[Document 574]:
J<--> 
J<--> Yahooooooooooooooooooooo!
J<--> 
J<--> What a game, we finally beat those diques...and in O.T.!
J<--> The Habs dominated this game and especially in O.T..

You realize that we dominated g...


### 7.5 Tracing Keyphrases Back to Original Documents

In [71]:
# Trace keyphrases back to their source documents
layer = topic_model.cluster_layers_[0]

# Pick a cluster to analyze
cluster_id = 0
print(f"Analyzing Cluster {cluster_id}: {layer.topic_names[cluster_id]}")

# Get keyphrases and documents for this cluster
keyphrases = layer.keyphrases[cluster_id][:5]  # Top 5 keyphrases
cluster_docs = get_cluster_documents(topic_model, 0, cluster_id, original_texts)

print(f"\nTop 5 keyphrases: {', '.join(keyphrases)}")
print(f"Checking these keyphrases in {cluster_docs['total_count']} cluster documents:")

# For each keyphrase, find which documents contain it
for kp in keyphrases[:3]:  # Check top 3 keyphrases
    print(f"\n'{kp}' appears in:")
    occurrences = []
    
    for idx, doc in zip(cluster_docs['indices'], cluster_docs['texts']):
        if kp.lower() in doc.lower():
            # Find the position and show context
            pos = doc.lower().find(kp.lower())
            start = max(0, pos - 30)
            end = min(len(doc), pos + len(kp) + 30)
            context = doc[start:end]
            
            # Highlight the keyphrase
            context = context.replace(kp, f"**{kp}**")
            context = context.replace(kp.lower(), f"**{kp.lower()}**")
            context = context.replace(kp.capitalize(), f"**{kp.capitalize()}**")
            
            occurrences.append((idx, context))
    
    if occurrences:
        for idx, context in occurrences[:2]:  # Show first 2 occurrences
            print(f"  Doc [{idx}]: ...{context}...")
        if len(occurrences) > 2:
            print(f"  ... and {len(occurrences) - 2} more occurrences")
    else:
        print(f"  Not found as exact match in any document")

Analyzing Cluster 0: NHL Broadcast Coverage and ESPN Scheduling Discussions

Top 5 keyphrases: espn bought, zones abc, hockey games, coverage sunday april, cbc
Checking these keyphrases in 25 cluster documents:

'espn bought' appears in:
  Doc [1435]: ...all of you
who are unaware -> ESPN bought the air time from ABC and did...
  Doc [2642]: ... the entire playoffs).  Since ESPN bought the
SCA contract, there are l...

'zones abc' appears in:
  Not found as exact match in any document

'hockey games' appears in:
  Doc [751]: ...eninsula that will be showing ****hockey games****.  I'm looking for something 
...
  Doc [1799]: ...n decided not
to televise the ****hockey games****.  La directrous de programme ...


### 7.6 Comparing Exemplars with All Cluster Documents

In [72]:
# Compare exemplar documents with all documents in a cluster
cluster_id = 1
layer = topic_model.cluster_layers_[0]

# Get cluster details including exemplar indices
cluster_details = get_cluster_details(topic_model, 0, cluster_id)
exemplar_indices = cluster_details.get('exemplar_indices', [])

# Get all documents in the cluster
all_docs = get_cluster_documents(topic_model, 0, cluster_id, original_texts)

print(f"Cluster {cluster_id}: {layer.topic_names[cluster_id]}")
print(f"Total documents: {all_docs['total_count']}")
print(f"Number of exemplars: {len(exemplar_indices)}")
print(f"Exemplar indices: {exemplar_indices}")

# Show which documents are exemplars
print("\nExemplar documents (most representative):")
for i, ex_idx in enumerate(exemplar_indices[:3]):
    if ex_idx in all_docs['indices']:
        doc_position = all_docs['indices'].index(ex_idx)
        print(f"\n[Exemplar {i+1} - Document {ex_idx}]:")
        print(all_docs['texts'][doc_position][:150] + "...")

# Show non-exemplar documents
non_exemplar_indices = [idx for idx in all_docs['indices'] if idx not in exemplar_indices]
print(f"\n\nNon-exemplar documents ({len(non_exemplar_indices)} total):")
for idx in non_exemplar_indices[:2]:
    doc_position = all_docs['indices'].index(idx)
    print(f"\n[Document {idx}]:")
    print(all_docs['texts'][doc_position][:150] + "...")

Cluster 1: Comprehensive Analysis of NHL Game Outcomes, Standings, and Player Metrics
Total documents: 18
Number of exemplars: 8
Exemplar indices: [994, 642, 76, 794, 609, 537, 211, 2343]

Exemplar documents (most representative):

[Exemplar 1 - Document 994]:
NHL RESULTS FOR GAMES PLAYED 4/15/93.

--------------------------------------------------------------------------------
                              ...

[Exemplar 2 - Document 642]:
Well tell us about your pool table!

-=- Andy -=-...

[Exemplar 3 - Document 76]:
Why not? I believe both the Devils and Islanders got 87 points.
Say for example, another team had this record : 20-37-47;
they had 20*2+47*1+37*0=87 w...


Non-exemplar documents (10 total):

[Document 44]:
Here are the NHL's alltime leaders in goals and points at the end of
the 1992-3 season. Again, much thanks to Joseph Achkar.

Carl

Notes: An active p...

[Document 370]:
NHL RESULTS FOR GAMES PLAYED 4/05/93.

--------------------------------------------------------

### 7.7 Analyzing Document Distribution Across Clusters

In [73]:
# Create a DataFrame showing document coverage across clusters
audit_df = create_audit_df(topic_model, layer_index=0)

# Build a document-to-cluster mapping
doc_cluster_map = {}
for _, row in audit_df.iterrows():
    cluster_id = row['cluster_id']
    topic_name = row['llm_topic_name']
    for doc_idx in row['document_indices']:
        doc_cluster_map[doc_idx] = {
            'cluster_id': cluster_id,
            'topic_name': topic_name
        }

# Find any unclustered documents
all_doc_indices = set(range(len(original_texts)))
clustered_indices = set(doc_cluster_map.keys())
unclustered_indices = all_doc_indices - clustered_indices

print(f"Document distribution analysis:")
print(f"Total documents: {len(original_texts)}")
print(f"Clustered documents: {len(clustered_indices)}")
print(f"Unclustered documents: {len(unclustered_indices)}")

if unclustered_indices:
    print(f"\nUnclustered document indices: {sorted(list(unclustered_indices))[:10]}")
    print("\nSample unclustered documents:")
    for idx in sorted(list(unclustered_indices))[:3]:
        print(f"\n[Document {idx}]:")
        print(original_texts[idx][:150] + "...")

# Show document coverage by cluster
print("\n\nDocument coverage by cluster:")
for _, row in audit_df.iterrows():
    percentage = (row['num_documents'] / len(original_texts)) * 100
    print(f"Cluster {row['cluster_id']}: {row['num_documents']} docs ({percentage:.1f}%) - {row['llm_topic_name'][:50]}...")

Document distribution analysis:
Total documents: 3000
Clustered documents: 1754
Unclustered documents: 1246

Unclustered document indices: [3, 5, 9, 11, 12, 13, 17, 18, 19, 22]

Sample unclustered documents:

[Document 3]:
Think!

It's the SCSI card doing the DMA transfers NOT the disks...

The SCSI card can do DMA transfers containing data from any of the SCSI devices
i...

[Document 5]:
Back in high school I worked as a lab assistant for a bunch of experimental
psychologists at Bell Labs.  When they were doing visual perception and
me...

[Document 9]:
If a Christian means someone who believes in the divinity of Jesus, it is safe
to say that Jesus was a Christian.
--
"On the first day after Christmas...


Document coverage by cluster:
Cluster 0: 25 docs (0.8%) - NHL Broadcast Coverage and ESPN Scheduling Discuss...
Cluster 1: 18 docs (0.6%) - Comprehensive Analysis of NHL Game Outcomes, Stand...
Cluster 2: 25 docs (0.8%) - In-Depth Discussions on NHL Game Strategies and Pl...
Cluster

## Summary of New Document Traceability Features

The enhanced audit functionality now provides several ways to trace back to original documents:

1. **`document_indices`** - Always included in audit DataFrames, showing which documents belong to each cluster

2. **`include_all_docs=True`** - Adds full document texts to the audit DataFrame:
   - `document_texts` column contains all document texts for each cluster
   - Requires passing `original_texts` parameter

3. **`max_docs_per_cluster`** - Limits the number of documents shown per cluster:
   - Creates `document_sample` column with limited documents
   - Adds `total_docs_in_cluster` to show the actual count

4. **`get_cluster_documents()`** - Helper function to easily retrieve all documents for a specific cluster:
   - Returns indices, texts, and total count
   - Supports `max_docs` parameter for limiting results

These features enable:
- Validating that clusters contain semantically related documents
- Tracing how keyphrases were extracted from actual document content
- Understanding why certain documents were grouped together
- Debugging clustering quality issues
- Creating detailed audit reports with full traceability