# Topic Modelling & Theme Discovery

## 1. Necessary Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from transformers import pipeline

import sys
sys.path.append('../')
import config
from utils import load_processed_data, save_processed_data, print_section_header

## 2. Data Loading

In [None]:
# Load cleaned data from previous notebook
df = load_processed_data('cleaned_data.csv')

print(f"Loaded {len(df)} processed responses")
print(f"Comments with text: {df['cleaned_comment'].notna().sum()}")
print(f"Empty comments: {df['cleaned_comment'].isna().sum()}")

print("\nGroup distribution:")
print(df['group'].value_counts())

Loaded processed data: ../data/processed/cleaned_data.csv
Loaded 582 processed responses
Comments with text: 422
Empty comments: 160

Group distribution:
group
Non-VOLT_control    177
Non-VOLT_pilot      164
VOLT_control        158
VOLT_pilot           83
Name: count, dtype: int64


## 3. Preparing Data for Topic Modelling

In [3]:
def prepare_text_for_topics(df):
    """Prepare text data for topic modelling"""
    # Filter valid comments
    documents = df[df['cleaned_comment'].notna()]['cleaned_comment'].tolist()
    
    print(f"Prepared {len(documents)} documents for topic modeling")
    print(f"Average document length: {np.mean([len(doc.split()) for doc in documents]):.1f} words")
    
    return documents

# Prepare documents
documents = prepare_text_for_topics(df)

print("\nSample documents:")
for i, doc in enumerate(documents[:3], 1):
    print(f"{i}: {doc}")

Prepared 422 documents for topic modeling
Average document length: 11.0 words

Sample documents:
1: good package
2: good customer service
3: far good charlie efficient helpful let hope continues confident


## 4. Topic Modelling

I'm using LDA with 3 different vectorisation techniques: 

* **Vanilla tf-idf** with 1000 max features, min_df = 2 i.e. the words should atleast appear in 2 documents and max_df = 0.8 to remove words appearing in more than 80% of the documents.

* **tf-idf with upto 3-word n-grams** to capture the phrases for better contextual analysis

* **Count Vectorizer with 3-word n-grams** to benchmark results against the n-gram tf-idf.

Then I'm using BERTopic with:

* MiniLM - all-MiniLM-L6-v2: A smaller and faster embedding for baseline 

* MPNet - all-mpnet-base-v2: A high-quality larger model, perhaps the winner?

* DistilBERT - distilbert-base-nli: Older model, have already been outperfomed by MiniLM and MPNet in STS benchmarks but for legacy's sake.

* RoBERTa - roberta-base-nli: Good semantic understanding

LDA vs BERTopic:
- LDA: Traditional, interpretable, proven for business analysis
- BERTopic: Modern, uses neural embeddings, better semantic understanding

In [4]:
def run_lda_analysis_enhanced(documents, n_topics=8):
    """Enhanced LDA with multiple vectorization approaches"""
    print(f"Running Enhanced LDA with {n_topics} topics...")
    
    # Test multiple vectorizers
    vectorizers = {
        'tfidf_basic': TfidfVectorizer(
            max_features=1000, min_df=2, max_df=0.8, stop_words='english'
        ),
        'tfidf_ngrams': TfidfVectorizer(
            max_features=1500, min_df=2, max_df=0.8, 
            ngram_range=(1, 3), stop_words='english'
        ),
        'count_ngrams': CountVectorizer(
            max_features=1000, min_df=2, max_df=0.8,
            ngram_range=(1, 3), stop_words='english'
        )
    }
    
    results = {}
    
    for vec_name, vectorizer in vectorizers.items():
        print(f"\n--- Using {vec_name} vectorizer ---")
        
        # Transform documents
        doc_term_matrix = vectorizer.fit_transform(documents)
        
        # Fit LDA model
        lda_model = LatentDirichletAllocation(
            n_components=n_topics,
            random_state=config.RANDOM_STATE,
            max_iter=50,
            learning_method='batch'
        )
        
        lda_model.fit(doc_term_matrix)
        
        # Get document-topic probabilities
        doc_topics = lda_model.transform(doc_term_matrix)
        
        # Extract topic words
        feature_names = vectorizer.get_feature_names_out()
        topic_words = []
        
        for topic_idx, topic in enumerate(lda_model.components_):
            top_words_idx = topic.argsort()[-10:][::-1]
            top_words = [feature_names[i] for i in top_words_idx]
            topic_words.append(top_words)
            print(f"  Topic {topic_idx}: {', '.join(top_words[:5])}")
        
        results[vec_name] = {
            'model': lda_model,
            'vectorizer': vectorizer,
            'doc_topics': doc_topics,
            'topic_words': topic_words
        }
    
    return results

# Run enhanced LDA analysis
lda_results = run_lda_analysis_enhanced(documents, config.N_TOPICS)

Running Enhanced LDA with 8 topics...

--- Using tfidf_basic vectorizer ---
  Topic 0: helpful, explained, extremely, patient, polite
  Topic 1: efficient, phone, good, helpful, communication
  Topic 2: good, service, price, great, informative
  Topic 3: customer, service, good, fantastic, experience
  Topic 4: excellent, service, customer, company, helpful
  Topic 5: helpful, friendly, really, staff, help
  Topic 6: quick, easy, set, service, friendly
  Topic 7: nice, fast, staff, services, set

--- Using tfidf_ngrams vectorizer ---
  Topic 0: excellent, excellent service, installation, phone, service
  Topic 1: customer, customer service, service, friendly, helpful
  Topic 2: great, helpful, extremely helpful, extremely, service
  Topic 3: good, service, good service, price, good price
  Topic 4: explained, company, helpful, work, services
  Topic 5: days, good, really, prompt, helpful
  Topic 6: helpful, patient, friendly, polite, contract
  Topic 7: efficient, friendly, nice, help,

In [5]:
def run_bertopic_analysis_enhanced(documents, n_topics=8):
    """Enhanced BERTopic with different embedding models"""
    print(f"Running Enhanced BERTopic with {n_topics} topics...")
    
    # Test different embedding models
    embedding_models = {
        'all-MiniLM-L6-v2': 'all-MiniLM-L6-v2',  
        'all-mpnet-base-v2': 'all-mpnet-base-v2', 
        'distilbert-base-nli': 'distilbert-base-nli-mean-tokens',  
        'roberta-base-nli': 'roberta-base-nli-stsb-mean-tokens' 
    }
    
    results = {}
    
    for model_name, model_path in embedding_models.items():
        print(f"\n--- Using {model_name} embeddings ---")
        
        # Create embedding model
        embedding_model = SentenceTransformer(model_path)
        
        # Create BERTopic model
        topic_model = BERTopic(
            embedding_model=embedding_model,
            nr_topics=n_topics,
            min_topic_size=config.MIN_TOPIC_SIZE,
            calculate_probabilities=True,
            verbose=False
        )
        
        # Fit model
        topics, probabilities = topic_model.fit_transform(documents)
        topic_info = topic_model.get_topic_info()
        
        unique_topics = len([t for t in set(topics) if t != -1])
        print(f"  Topics found: {unique_topics}")
        
        for i in range(unique_topics):
            if i in topic_model.get_topics():
                topic_words = topic_model.get_topic(i)
                if topic_words:
                    words = [word for word, score in topic_words[:5]]
                    print(f"    Topic {i}: {', '.join(words)}")
        
        results[model_name] = {
            'model': topic_model,
            'topics': topics,
            'probabilities': probabilities,
            'topic_info': topic_info
        }
    
    return results

# Run enhanced BERTopic analysis
bertopic_results = run_bertopic_analysis_enhanced(documents, config.N_TOPICS)

Running Enhanced BERTopic with 8 topics...

--- Using all-MiniLM-L6-v2 embeddings ---


OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


  Topics found: 2
    Topic 0: service, good, excellent, happy, great
    Topic 1: service, helpful, company, customer, good

--- Using all-mpnet-base-v2 embeddings ---
  Topics found: 7
    Topic 0: service, customer, good, excellent, great
    Topic 1: installation, engineer, install, came, day
    Topic 2: friendly, helpful, lady, spoke, polite
    Topic 3: call, company, different, said, change
    Topic 4: broadband, company, charge, connection, could
    Topic 5: mark, deal, helpful, provided, know
    Topic 6: helpful, informative, really, informant, everyone

--- Using distilbert-base-nli embeddings ---
  Topics found: 7
    Topic 0: company, would, call, media, told
    Topic 1: helpful, friendly, polite, staff, easy
    Topic 2: service, good, price, great, excellent
    Topic 3: everything, explained, really, polite, professional
    Topic 4: quick, fast, service, set, easy
    Topic 5: broadband, internet, skills, helpful, phone
    Topic 6: customer, service, excellent, gr

## 5. Model Evaluation

Since there is no homogeneous way to compare the model performances for LDA and BERTopic as they are fundamentally different. While LDA uses a Bayesian probabilistic approach, BERTopic is a clustering based approach. The difference due to this is LDA assumes all documents belong to all topics with a certain probability and BERTopic hard-assigns every document to one group. The idea to use 2 different model approaches was to give a holistic view of the topic modelling problem. Due to time constraints, I couldn't experiment much with models. So, I devised a unified scoring criterion based on 3 different components for both LDA and BERTopic. For LDA models these are the 3 metrics:

1. **Interpretability Score (40% weight)**
    - Unique words / Total words across topics, higher = better
    - For example: Topic 1: [service, good, helpful, staff, quick], Topic 2: [service, good, staff, helpful, friendly]; 6 unique words across 2 topics and total 10 words across the topics, interpretability=6/10, i.e. low interpretabilty, 4 words overlap, not a good distinction.

2. **Focus Score (30% weight)**
    - Average maximum topic probability per document, the higher the better.
    - It gives us 2 broader focus categories, high focus (clearly belonging to one topic) and low focus (unclear themes)
    - For example: Document1 - [0.8, 0.4, 0.3] average across 3 topics, 0.8 indicates high focus; Document2 - [0.4, 0.3, 0.3] average across 3 topics, 0.4 indicates low focus, unclear themes

3. **Business Relevance (30% weight)**
    - % of top words relevant for business. In this case, some keywords like 'helpful', 'friendly', 'quick' etc. Higher = better

So, the composite score for LDA models will be 0.4*(Interpretability) + 0.3*(Focus Score) + 0.3*(Business Relevance)

<br>

For BERTopic models, these are the 3 metrics:

1. **Topic Quality (40% weight)**
    - Meaningful topics found / Topics requested, higher = better. So, I requested 8 topics and it returns 7; the quality is high.

2. **Coverage (30% weight)**
    - % of documents assigned to meaningful topics i.e. not in outliers. For example, 16 out of 20 documents lie in 7 of the topics found and 4 go into the outlier; the coverage will be 16/20 i.e. 80% of documents are fitting well into the topics and 20% are outliers. Higher = better

3. **Business Relevance (30% weight)**
    - Same as LDA

So, the composite score for BERTopic models will be 0.4*(Topic Quality) + 0.3*(Coverage) + 0.3*(Business Relevance)


In [6]:
def evaluate_lda_models(lda_results):
    """Evaluate LDA models using interpretability metrics"""
    print_section_header("LDA Models Evaluation")
    
    lda_scores = {}
    # Need some domain expertise here
    business_keywords = {'service', 'helpful', 'staff', 'install', 'problem', 'quick', 'professional', 'friendly'}
    
    for vec_name, result in lda_results.items():
        print(f"\n{vec_name}:")
        
        # Topic interpretability (word diversity)
        topic_words = result['topic_words']
        all_words = set()
        total_words = 0
        for topic in topic_words:
            top_5 = topic[:5]
            all_words.update(top_5)
            total_words += len(top_5)
        
        interpretability = len(all_words) / total_words if total_words > 0 else 0
        
        # Topic focus (document concentration)
        doc_topics = result['doc_topics']
        max_probs = np.max(doc_topics, axis=1)
        focus_score = np.mean(max_probs)
        
        # Business relevance
        business_relevant_words = 0
        for topic in topic_words:
            for word in topic[:5]:
                if word in business_keywords:
                    business_relevant_words += 1
        
        business_relevance = business_relevant_words / (len(topic_words) * 5)
        
        # Composite score
        composite = (interpretability * 0.4) + (focus_score * 0.3) + (business_relevance * 0.3)
        
        print(f"  Interpretability: {interpretability:.3f}")
        print(f"  Focus Score: {focus_score:.3f}")
        print(f"  Business Relevance: {business_relevance:.3f}")
        print(f"  COMPOSITE SCORE: {composite:.3f}")
        
        lda_scores[vec_name] = composite
    
    return lda_scores

# Evaluate LDA models
lda_scores = evaluate_lda_models(lda_results)


LDA MODELS EVALUATION

tfidf_basic:
  Interpretability: 0.700
  Focus Score: 0.665
  Business Relevance: 0.325
  COMPOSITE SCORE: 0.577

tfidf_ngrams:
  Interpretability: 0.750
  Focus Score: 0.698
  Business Relevance: 0.300
  COMPOSITE SCORE: 0.599

count_ngrams:
  Interpretability: 0.650
  Focus Score: 0.768
  Business Relevance: 0.275
  COMPOSITE SCORE: 0.573


In [7]:
def evaluate_bertopic_models(bertopic_results):
    """Evaluate BERTopic models using coverage and quality metrics"""
    print_section_header('BERTopic Models Evaluation')
    
    bertopic_scores = {}
    business_keywords = {'service', 'helpful', 'staff', 'install', 'problem', 'quick', 'professional', 'friendly'}
    
    for model_name, result in bertopic_results.items():
        print(f"\n{model_name}:")
        
        # Extract and validate topics array
        topics_array = result['topics']
        topic_model = result['model']
        
        # Convert to numpy array if needed
        if not isinstance(topics_array, np.ndarray):
            topics_array = np.array(topics_array)
        
        # Topic quality (meaningful topics found)
        unique_topics = len([t for t in set(topics_array) if t != -1])
        topic_quality = unique_topics / config.N_TOPICS
        
        # Coverage (documents assigned to meaningful topics)
        coverage = np.sum(topics_array != -1) / len(topics_array)
        
        # Business relevance
        business_relevant_words = 0
        for i in range(min(unique_topics, 5)):
            if i in topic_model.get_topics():
                topic_words = topic_model.get_topic(i)
                for word, score in topic_words[:5]:
                    if word in business_keywords:
                        business_relevant_words += 1
        
        business_relevance = business_relevant_words / (min(unique_topics, 5) * 5) if unique_topics > 0 else 0
        
        # Composite score
        composite = (topic_quality * 0.4) + (coverage * 0.3) + (business_relevance * 0.3)
        
        print(f"  Topic Quality: {topic_quality:.3f}")
        print(f"  Coverage: {coverage:.3f}")
        print(f"  Business Relevance: {business_relevance:.3f}")
        print(f"  COMPOSITE SCORE: {composite:.3f}")
        
        bertopic_scores[model_name] = composite
    
    return bertopic_scores

# Evaluate BERTopic models
bertopic_scores = evaluate_bertopic_models(bertopic_results)


BERTOPIC MODELS EVALUATION

all-MiniLM-L6-v2:
  Topic Quality: 0.250
  Coverage: 1.000
  Business Relevance: 0.300
  COMPOSITE SCORE: 0.490

all-mpnet-base-v2:
  Topic Quality: 0.875
  Coverage: 0.555
  Business Relevance: 0.160
  COMPOSITE SCORE: 0.564

distilbert-base-nli:
  Topic Quality: 0.875
  Coverage: 0.682
  Business Relevance: 0.280
  COMPOSITE SCORE: 0.639

roberta-base-nli:
  Topic Quality: 0.250
  Coverage: 1.000
  Business Relevance: 0.400
  COMPOSITE SCORE: 0.520


In [8]:
def compare_all_models(lda_scores, bertopic_scores):
    """Compare all models and select best overall"""
    print_section_header("\nOverall Model Comparison")
    
    # Combine all scores
    all_scores = {}
    
    # Add LDA scores with prefix
    for model_name, score in lda_scores.items():
        all_scores[f"LDA_{model_name}"] = score
    
    # Add BERTopic scores with prefix
    for model_name, score in bertopic_scores.items():
        all_scores[f"BERTopic_{model_name}"] = score
    
    if not all_scores:
        print("No models to compare!")
        return None, None
    
    # Sort by score
    sorted_models = sorted(all_scores.items(), key=lambda x: x[1], reverse=True)
    
    print("\nModel Rankings:")
    for i, (model_name, score) in enumerate(sorted_models, 1):
        print(f"  {i}. {model_name}: {score:.3f}")
    
    # Best model
    best_model, best_score = sorted_models[0]
    
    print(f"\nBEST OVERALL MODEL: {best_model}")
    print(f"SCORE: {best_score:.3f}")
    
    return all_scores, best_model

# Compare all models
all_model_scores, best_overall = compare_all_models(lda_scores, bertopic_scores)



OVERALL MODEL COMPARISON

Model Rankings:
  1. BERTopic_distilbert-base-nli: 0.639
  2. LDA_tfidf_ngrams: 0.599
  3. LDA_tfidf_basic: 0.577
  4. LDA_count_ngrams: 0.573
  5. BERTopic_all-mpnet-base-v2: 0.564
  6. BERTopic_roberta-base-nli: 0.520
  7. BERTopic_all-MiniLM-L6-v2: 0.490

BEST OVERALL MODEL: BERTopic_distilbert-base-nli
SCORE: 0.639


Interestingly, the MPNet model which was slower and supposed to be more powerful didn't perform better than couple of LDA models. I guess that indicates that using bigger and complex models for this use case is an overkill. However, the lightweight MiniLM model with BERTopic performed really well and scored 2.2% more than the second-best model. 

## 6. Theme Extraction with LLMs

For this, I pre-set 6 themes based on my understanding (this might need validation from subject experts again). I used 3 different open-source LLM models viz. BART MNLI from Facebook which uses a Seq2Seq encoder-decoder architecture, DeBERTa-MNLI from Microsoft which uses encoder-only approach and cross-encoder DeBERTa which uses cross-encoding for better contextual scoring. 

I used 2 metrics: confidence score and theme diversity. To calculate the composite score, I used a weighted approach with 70% confidence score and 30% diversity. 

In [9]:
def setup_llm_classifiers():
    """Setup multiple LLM classifiers for comparison"""
    print("Setting up LLM classifiers...")
    
    # Test multiple open-source models suitable for zero-shot classification
    models_to_test = {
        'facebook/bart-large-mnli': 'BART-MNLI (Facebook)',
        'microsoft/deberta-large-mnli': 'DeBERTa-MNLI (Microsoft)',
        'cross-encoder/nli-deberta-v3-base': 'DeBERTa-v3-NLI'
    }
    
    classifiers = {}
    
    for model_name, model_desc in models_to_test.items():
        try:
            print(f"Loading {model_desc}...")
            classifier = pipeline(
                "zero-shot-classification",
                model=model_name
            )
            classifiers[model_desc] = classifier
            print(f"{model_desc} loaded successfully")
        except Exception as e:
            print(f"Failed to load {model_desc}: {e}")
            classifiers[model_desc] = None
    
    return classifiers

# Setup classifiers
llm_classifiers = setup_llm_classifiers()

Setting up LLM classifiers...
Loading BART-MNLI (Facebook)...


Device set to use mps:0


BART-MNLI (Facebook) loaded successfully
Loading DeBERTa-MNLI (Microsoft)...


Some weights of the model checkpoint at microsoft/deberta-large-mnli were not used when initializing DebertaForSequenceClassification: ['config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


DeBERTa-MNLI (Microsoft) loaded successfully
Loading DeBERTa-v3-NLI...


Device set to use mps:0


DeBERTa-v3-NLI loaded successfully


In [10]:
def define_theme_categories():
    """Define candidate themes for call center analysis"""
    
    candidate_labels = [
        "service quality",
        "agent helpfulness", 
        "process efficiency",
        "technical support",
        "communication clarity",
        "problem resolution"
    ]
    
    print("Theme categories defined:")
    for i, theme in enumerate(candidate_labels, 1):
        print(f"  {i}. {theme}")
    
    return candidate_labels

# Define themes
theme_categories = define_theme_categories()

Theme categories defined:
  1. service quality
  2. agent helpfulness
  3. process efficiency
  4. technical support
  5. communication clarity
  6. problem resolution


In [11]:
def compare_llm_models(classifiers, documents, candidate_labels, sample_size=15):
    """Compare performance of different LLM models"""
    print(f"\nComparing LLM models on {sample_size} documents...")
    
    # Sample documents
    np.random.seed(config.RANDOM_STATE)
    if len(documents) > sample_size:
        sample_docs = np.random.choice(documents, sample_size, replace=False)
    else:
        sample_docs = documents[:sample_size]
    
    all_results = {}
    
    for model_name, classifier in classifiers.items():
        if classifier is None:
            print(f"\nSkipping {model_name} (failed to load)")
            continue
            
        print(f"\nTesting {model_name}:")
        print("-" * 30)
        
        model_results = []
        
        for i, doc in enumerate(sample_docs[:10]):  # Test on first 10 docs
            if len(doc.split()) > 3:
                try:
                    result = classifier(doc, candidate_labels)
                    
                    model_results.append({
                        'document': doc,
                        'top_theme': result['labels'][0],
                        'confidence': result['scores'][0],
                        'all_scores': dict(zip(result['labels'], result['scores']))
                    })
                    
                    if i < 2:  # Show first 2 examples
                        print(f"  Doc {i+1}: {doc[:50]}...")
                        print(f"  Theme: {result['labels'][0]} ({result['scores'][0]:.3f})")
                        
                except Exception as e:
                    print(f"  Error on doc {i}: {e}")
                    continue
        
        # Calculate model statistics
        if model_results:
            theme_dist = {}
            confidences = []
            
            for result in model_results:
                theme = result['top_theme']
                theme_dist[theme] = theme_dist.get(theme, 0) + 1
                confidences.append(result['confidence'])
            
            avg_confidence = np.mean(confidences)
            theme_diversity = len(theme_dist)
            
            print(f"  Processed: {len(model_results)} docs")
            print(f"  Avg confidence: {avg_confidence:.3f}")
            print(f"  Themes found: {theme_diversity}")
            print(f"  Top themes: {list(theme_dist.keys())}")
            
            all_results[model_name] = {
                'results': model_results,
                'avg_confidence': avg_confidence,
                'theme_diversity': theme_diversity,
                'theme_distribution': theme_dist
            }
    
    return all_results

# Compare LLM models
llm_comparison = compare_llm_models(llm_classifiers, documents, theme_categories)


Comparing LLM models on 15 documents...

Testing BART-MNLI (Facebook):
------------------------------


  Doc 1: good value reasonable prices...
  Theme: service quality (0.479)
  Doc 2: great service friendly staff...
  Theme: service quality (0.780)


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  Processed: 8 docs
  Avg confidence: 0.564
  Themes found: 2
  Top themes: ['service quality', 'agent helpfulness']

Testing DeBERTa-MNLI (Microsoft):
------------------------------
  Doc 1: good value reasonable prices...
  Theme: agent helpfulness (0.248)
  Doc 2: great service friendly staff...
  Theme: service quality (0.813)
  Processed: 8 docs
  Avg confidence: 0.493
  Themes found: 3
  Top themes: ['agent helpfulness', 'service quality', 'communication clarity']

Testing DeBERTa-v3-NLI:
------------------------------
  Doc 1: good value reasonable prices...
  Theme: process efficiency (0.274)
  Doc 2: great service friendly staff...
  Theme: service quality (0.624)
  Processed: 8 docs
  Avg confidence: 0.566
  Themes found: 4
  Top themes: ['process efficiency', 'service quality', 'communication clarity', 'technical support']


In [12]:
def select_best_llm_model(llm_comparison):
    """Select best performing LLM model"""
    print("\nLLM MODEL COMPARISON")
    print("=" * 35)
    
    if not llm_comparison:
        print("No models to compare")
        return None, None
    
    # Score models based on confidence and theme diversity
    model_scores = {}
    
    for model_name, results in llm_comparison.items():
        if results:
            # Composite score: confidence (70%) + theme diversity (30%)
            confidence_score = results['avg_confidence']
            diversity_score = min(results['theme_diversity'] / 6, 1.0)  # Normalize to max 6 themes
            
            composite_score = (confidence_score * 0.7) + (diversity_score * 0.3)
            model_scores[model_name] = composite_score
            
            print(f"{model_name}:")
            print(f"  Confidence: {confidence_score:.3f}")
            print(f"  Diversity: {diversity_score:.3f}")
            print(f"  Composite: {composite_score:.3f}")
            print()
    
    if model_scores:
        best_model = max(model_scores.keys(), key=model_scores.get)
        best_score = model_scores[best_model]
        
        print(f"BEST LLM MODEL: {best_model}")
        print(f"SCORE: {best_score:.3f}")
        
        return best_model, llm_comparison[best_model]
    
    return None, None

# Select best model
best_llm_model, best_llm_results = select_best_llm_model(llm_comparison)


LLM MODEL COMPARISON
BART-MNLI (Facebook):
  Confidence: 0.564
  Diversity: 0.333
  Composite: 0.495

DeBERTa-MNLI (Microsoft):
  Confidence: 0.493
  Diversity: 0.500
  Composite: 0.495

DeBERTa-v3-NLI:
  Confidence: 0.566
  Diversity: 0.667
  Composite: 0.596

BEST LLM MODEL: DeBERTa-v3-NLI
SCORE: 0.596


The scores are not so great, probably owing to the low dataset size. The top model DeBERTa-v3-NLI shows classification are moderately reliable with good coverage of themes. The model was able to capture 4 out of 6 themes given to it. 

## 7. Group-specific Topic Analysis

In [13]:
# Get the best BERTopic model results
model_key = best_overall.replace('BERTopic_', '')
best_model = bertopic_results[model_key]['model']
topics_array = bertopic_results[model_key]['topics']

print(f"Using model: {best_overall}")
print(f"Topics array length: {len(topics_array)}")

# Create dataframe with document-topic assignments
docs_with_topics = df[df['cleaned_comment'].notna()].copy().reset_index(drop=True)
docs_with_topics['assigned_topic'] = topics_array[:len(docs_with_topics)]
docs_with_topics['meaningful_topic'] = docs_with_topics['assigned_topic'] != -1

print(f"Documents with topics: {len(docs_with_topics)}")
print(f"Meaningful topics: {docs_with_topics['meaningful_topic'].sum()}")
print(f"Outliers: {(~docs_with_topics['meaningful_topic']).sum()}")

Using model: BERTopic_distilbert-base-nli
Topics array length: 422
Documents with topics: 422
Meaningful topics: 288
Outliers: 134


### 7.1 Non-VOLT Customers

In [14]:
# Non-VOLT group analysis
non_volt_data = docs_with_topics[docs_with_topics['segment'] == 'Non-VOLT']
print(f"Non-VOLT customers: {len(non_volt_data)} documents")

# Top topics for Non-VOLT customers
non_volt_meaningful = non_volt_data[non_volt_data['assigned_topic'] != -1]
topic_counts = non_volt_meaningful['assigned_topic'].value_counts()

print(f"\nTop latent topics mentioned (Non-VOLT):")
for i, (topic_id, count) in enumerate(topic_counts.head(5).items()):
    pct = (count / len(non_volt_data)) * 100
    
    # Get topic words
    if topic_id in best_model.get_topics():
        topic_words = [word for word, score in best_model.get_topic(topic_id)[:5]]
        print(f"   Topic {topic_id}: {', '.join(topic_words)} ({count} docs, {pct:.1f}%)")
    else:
        print(f"   Topic {topic_id}: [unavailable] ({count} docs, {pct:.1f}%)")

print(f"\nNon-VOLT Summary:")
print(f"  Total documents: {len(non_volt_data)}")
print(f"  Meaningful topics: {len(non_volt_meaningful)}")
print(f"  Outliers: {len(non_volt_data) - len(non_volt_meaningful)}")

Non-VOLT customers: 252 documents

Top latent topics mentioned (Non-VOLT):
   Topic 1: helpful, friendly, polite, staff, easy (42 docs, 16.7%)
   Topic 0: company, would, call, media, told (41 docs, 16.3%)
   Topic 2: service, good, price, great, excellent (33 docs, 13.1%)
   Topic 3: everything, explained, really, polite, professional (27 docs, 10.7%)
   Topic 4: quick, fast, service, set, easy (12 docs, 4.8%)

Non-VOLT Summary:
  Total documents: 252
  Meaningful topics: 173
  Outliers: 79


In [15]:
# Non-VOLT treatment comparison
non_volt_control = non_volt_data[non_volt_data['treatment'] == 'control']
non_volt_pilot = non_volt_data[non_volt_data['treatment'] == 'pilot']

print(f"Control: {len(non_volt_control)} docs | Pilot: {len(non_volt_pilot)} docs")

# Get topic distributions
control_topics = non_volt_control[non_volt_control['assigned_topic'] != -1]['assigned_topic'].value_counts()
pilot_topics = non_volt_pilot[non_volt_pilot['assigned_topic'] != -1]['assigned_topic'].value_counts()

# Compare top topics
all_topics = set(list(control_topics.index[:3]) + list(pilot_topics.index[:3]))

print("\n2. Topic distributions differ by treatment (Non-VOLT):")
for topic_id in sorted(all_topics):
    control_count = control_topics.get(topic_id, 0)
    pilot_count = pilot_topics.get(topic_id, 0)
    
    control_pct = (control_count / len(non_volt_control)) * 100 if len(non_volt_control) > 0 else 0
    pilot_pct = (pilot_count / len(non_volt_pilot)) * 100 if len(non_volt_pilot) > 0 else 0
    diff = pilot_pct - control_pct
    
    # Get topic description
    if topic_id in best_model.get_topics():
        topic_words = [word for word, score in best_model.get_topic(topic_id)[:3]]
        topic_desc = ', '.join(topic_words)
    else:
        topic_desc = 'unknown'
    
    print(f"   Topic {topic_id} ({topic_desc}):")
    print(f"     Control: {control_pct:.1f}% | Pilot: {pilot_pct:.1f}% | Diff: {diff:+.1f}%")

Control: 132 docs | Pilot: 120 docs

2. Topic distributions differ by treatment (Non-VOLT):
   Topic 0 (company, would, call):
     Control: 15.9% | Pilot: 16.7% | Diff: +0.8%
   Topic 1 (helpful, friendly, polite):
     Control: 17.4% | Pilot: 15.8% | Diff: -1.6%
   Topic 2 (service, good, price):
     Control: 11.4% | Pilot: 15.0% | Diff: +3.6%
   Topic 3 (everything, explained, really):
     Control: 12.9% | Pilot: 8.3% | Diff: -4.5%


* Price discussion increased in the treatment. That means the new script likely handled the pricing issues well. 

* There are increased mentions of helpful, friendly and explained meaning that the new script helped improve the perceived agent quality.

* There is a decrease in service excellence mentions, which is a concern for a positive service reputation.

### 7.2 VOLT Customers

In [16]:
# VOLT group analysis
volt_data = docs_with_topics[docs_with_topics['segment'] == 'VOLT']
print(f"VOLT customers: {len(volt_data)} documents")

# Top topics for VOLT customers
volt_meaningful = volt_data[volt_data['assigned_topic'] != -1]
topic_counts = volt_meaningful['assigned_topic'].value_counts()

print(f"\n1. Top latent topics mentioned (VOLT):")
for i, (topic_id, count) in enumerate(topic_counts.head(5).items()):
    pct = (count / len(volt_data)) * 100
    
    # Get topic words
    if topic_id in best_model.get_topics():
        topic_words = [word for word, score in best_model.get_topic(topic_id)[:5]]
        print(f"   Topic {topic_id}: {', '.join(topic_words)} ({count} docs, {pct:.1f}%)")
    else:
        print(f"   Topic {topic_id}: [unavailable] ({count} docs, {pct:.1f}%)")

print(f"\nVOLT Summary:")
print(f"  Total documents: {len(volt_data)}")
print(f"  Meaningful topics: {len(volt_meaningful)}")
print(f"  Outliers: {len(volt_data) - len(volt_meaningful)}")

VOLT customers: 170 documents

1. Top latent topics mentioned (VOLT):
   Topic 0: company, would, call, media, told (34 docs, 20.0%)
   Topic 1: helpful, friendly, polite, staff, easy (31 docs, 18.2%)
   Topic 2: service, good, price, great, excellent (17 docs, 10.0%)
   Topic 3: everything, explained, really, polite, professional (14 docs, 8.2%)
   Topic 4: quick, fast, service, set, easy (9 docs, 5.3%)

VOLT Summary:
  Total documents: 170
  Meaningful topics: 115
  Outliers: 55


In [17]:
# VOLT treatment comparison
volt_control = volt_data[volt_data['treatment'] == 'control']
volt_pilot = volt_data[volt_data['treatment'] == 'pilot']

print(f"Control: {len(volt_control)} docs | Pilot: {len(volt_pilot)} docs")

# Get topic distributions
control_topics = volt_control[volt_control['assigned_topic'] != -1]['assigned_topic'].value_counts()
pilot_topics = volt_pilot[volt_pilot['assigned_topic'] != -1]['assigned_topic'].value_counts()

# Compare top topics
all_topics = set(list(control_topics.index[:3]) + list(pilot_topics.index[:3]))

print("\n2. Topic distributions differ by treatment (VOLT):")
for topic_id in sorted(all_topics):
    control_count = control_topics.get(topic_id, 0)
    pilot_count = pilot_topics.get(topic_id, 0)
    
    control_pct = (control_count / len(volt_control)) * 100 if len(volt_control) > 0 else 0
    pilot_pct = (pilot_count / len(volt_pilot)) * 100 if len(volt_pilot) > 0 else 0
    diff = pilot_pct - control_pct
    
    # Get topic description
    if topic_id in best_model.get_topics():
        topic_words = [word for word, score in best_model.get_topic(topic_id)[:3]]
        topic_desc = ', '.join(topic_words)
    else:
        topic_desc = 'unknown'
    
    print(f"   Topic {topic_id} ({topic_desc}):")
    print(f"     Control: {control_pct:.1f}% | Pilot: {pilot_pct:.1f}% | Diff: {diff:+.1f}%")

Control: 115 docs | Pilot: 55 docs

2. Topic distributions differ by treatment (VOLT):
   Topic 0 (company, would, call):
     Control: 23.5% | Pilot: 12.7% | Diff: -10.8%
   Topic 1 (helpful, friendly, polite):
     Control: 13.9% | Pilot: 27.3% | Diff: +13.4%
   Topic 2 (service, good, price):
     Control: 12.2% | Pilot: 5.5% | Diff: -6.7%
   Topic 3 (everything, explained, really):
     Control: 6.1% | Pilot: 12.7% | Diff: +6.6%


* There is a drastic improvement in the mentions about technical support. This means that the new script handled the technical queries more effectively. This also suggests that VOLT customers are more inclined towards technical support than the non-VOLT customers. 

* There is a minor increase in the customer service excellence mentions. With the non-VOLT customers having a negative impact, this means that the new script consistently fails to establish a good customer service reputation. Thus, there is an urgent need to address this issue in the script.

* Similar to non-VOLT group, the staff personality mentions have also increased in the VOLT group. That means the script was able to convey a good agent personality among both categories of customers. 

**Overall Insights:**

* Script appears more effective for VOLT customers. Shows targeted technical support improvement without degrading service perception

* Non-VOLT customers show mixed results. While agent perception improved, there's a significant drop in service excellence mentions.

* Both segments still have high complaint rates. Company/process issues remain the top topic.

* Sample size imbalance for VOLT - Control group much larger (115 vs 55), which may affect the reliability of pilot group insights.

## 8. Behavioural Metrics by Group (VOLT vs Non-VOLT)

In [18]:
# Agent personality keywords
personality_keywords = config.BEHAVIORAL_THEMES['agent_personality']
print(f"Agent personality keywords: {personality_keywords}")

# Non-VOLT customers
non_volt_personality = 0
for _, row in non_volt_data.iterrows():
    comment = row['cleaned_comment']
    if pd.notna(comment):
        comment_lower = comment.lower()
        if any(keyword in comment_lower for keyword in personality_keywords):
            non_volt_personality += 1

non_volt_pct = (non_volt_personality / len(non_volt_data)) * 100

# VOLT customers  
volt_personality = 0
for _, row in volt_data.iterrows():
    comment = row['cleaned_comment']
    if pd.notna(comment):
        comment_lower = comment.lower()
        if any(keyword in comment_lower for keyword in personality_keywords):
            volt_personality += 1

volt_pct = (volt_personality / len(volt_data)) * 100

print(f"\nAgent Personality Mentions:")
print(f"  Non-VOLT: {non_volt_personality}/{len(non_volt_data)} = {non_volt_pct:.1f}%")
print(f"  VOLT: {volt_personality}/{len(volt_data)} = {volt_pct:.1f}%")
print(f"  Difference: {volt_pct - non_volt_pct:+.1f}%")

Agent personality keywords: ['friendly', 'helpful', 'polite', 'lovely', 'nice', 'professional']

Agent Personality Mentions:
  Non-VOLT: 105/252 = 41.7%
  VOLT: 71/170 = 41.8%
  Difference: +0.1%


There is hardly any difference between the perceived agent personality between both the groups. It means the agents are not prioritising any specific groups. Everyone is treated the same.

In [19]:
# Clarity keywords
clarity_keywords = config.BEHAVIORAL_THEMES['clarity']
print(f"\nClarity keywords: {clarity_keywords}")

# Non-VOLT customers
non_volt_clarity = 0
for _, row in non_volt_data.iterrows():
    comment = row['cleaned_comment']
    if pd.notna(comment):
        comment_lower = comment.lower()
        if any(keyword in comment_lower for keyword in clarity_keywords):
            non_volt_clarity += 1

non_volt_clarity_pct = (non_volt_clarity / len(non_volt_data)) * 100

# VOLT customers
volt_clarity = 0
for _, row in volt_data.iterrows():
    comment = row['cleaned_comment']
    if pd.notna(comment):
        comment_lower = comment.lower()
        if any(keyword in comment_lower for keyword in clarity_keywords):
            volt_clarity += 1

volt_clarity_pct = (volt_clarity / len(volt_data)) * 100

print(f"\nClarity Mentions:")
print(f"  Non-VOLT: {non_volt_clarity}/{len(non_volt_data)} = {non_volt_clarity_pct:.1f}%")
print(f"  VOLT: {volt_clarity}/{len(volt_data)} = {volt_clarity_pct:.1f}%")
print(f"  Difference: {volt_clarity_pct - non_volt_clarity_pct:+.1f}%")


Clarity keywords: ['clear', 'explained', 'understand', 'easy', 'simple', 'straightforward']

Clarity Mentions:
  Non-VOLT: 48/252 = 19.0%
  VOLT: 31/170 = 18.2%
  Difference: -0.8%


VOLT customers tend to have slightly lesser clarity over the issues as compared to non-VOLT customers. Perhaps it is because the VOLT customers (assuming premium segment) expect more personalised and clear service because they have paid more(?)

In [None]:
# Reassurance keywords
reassurance_keywords = config.BEHAVIORAL_THEMES['reassurance']
print(f"\nReassurance keywords: {reassurance_keywords}")

# Non-VOLT customers
non_volt_reassurance = 0
for _, row in non_volt_data.iterrows():
    comment = row['cleaned_comment']
    if pd.notna(comment):
        comment_lower = comment.lower()
        if any(keyword in comment_lower for keyword in reassurance_keywords):
            non_volt_reassurance += 1

non_volt_reassurance_pct = (non_volt_reassurance / len(non_volt_data)) * 100

# VOLT customers
volt_reassurance = 0
for _, row in volt_data.iterrows():
    comment = row['cleaned_comment']
    if pd.notna(comment):
        comment_lower = comment.lower()
        if any(keyword in comment_lower for keyword in reassurance_keywords):
            volt_reassurance += 1

volt_reassurance_pct = (volt_reassurance / len(volt_data)) * 100

print(f"\nReassurance Mentions:")
print(f"  Non-VOLT: {non_volt_reassurance}/{len(non_volt_data)} = {non_volt_reassurance_pct:.1f}%")
print(f"  VOLT: {volt_reassurance}/{len(volt_data)} = {volt_reassurance_pct:.1f}%")
print(f"  Difference: {volt_reassurance_pct - non_volt_reassurance_pct:+.1f}%")


Reassurance keywords: ['reassured', 'confident', 'trust', 'reliable', 'secure', 'comfortable']

Reassurance Mentions:
  Non-VOLT: 2/252 = 0.8%
  VOLT: 4/170 = 2.4%
  Difference: +1.6%


There is almost no mentions about reassurance. Just 2 in Non-VOLT and 4 in VOLT. But seeing percentage-wise, VOLT customers are comparatively more satisfied and reassured than the non-VOLT customers. 

## 9. Results Compilation and Data Export

In [21]:
# Compile behavioral stats from existing calculations
behavioral_stats = {
    'agent_personality': {
        'volt_pct': volt_pct,
        'non_volt_pct': non_volt_pct,
        'difference': volt_pct - non_volt_pct
    },
    'clarity': {
        'volt_pct': volt_clarity_pct,
        'non_volt_pct': non_volt_clarity_pct,
        'difference': volt_clarity_pct - non_volt_clarity_pct
    },
    'reassurance': {
        'volt_pct': volt_reassurance_pct,
        'non_volt_pct': non_volt_reassurance_pct,
        'difference': volt_reassurance_pct - non_volt_reassurance_pct
    }
}

In [22]:
# # Compile all results for pipeline (opted out because of the file beingtoo large, will use the parameters to use the model again)
# topic_results = {
#     'lda_results': lda_results,
#     'bertopic_results': bertopic_results,
#     'llm_comparison': llm_comparison,
#     'behavioral_stats': behavioral_stats,
#     'best_models': {
#         'overall_best': best_overall,
#         'best_llm': best_llm_model
#     },
#     'documents': documents,
#     'docs_with_topics': docs_with_topics
# }

# # Compile essential results (without heavy model objects)
topic_results = {
    'lda_results': {
        'best_model_name': max(lda_scores.keys(), key=lda_scores.get),
        'topic_words': lda_results[max(lda_scores.keys(), key=lda_scores.get)]['topic_words'],
        'doc_topics': lda_results[max(lda_scores.keys(), key=lda_scores.get)]['doc_topics']
    },
    'bertopic_results': {
        'best_model_name': best_overall.replace('BERTopic_', ''),
        'topics_array': bertopic_results[best_overall.replace('BERTopic_', '')]['topics'],
        'topic_info': bertopic_results[best_overall.replace('BERTopic_', '')]['topic_info']
    },
    'behavioral_stats': behavioral_stats,
    'best_models': {
        'overall_best': best_overall,
        'best_llm': best_llm_model
    },
    'documents': documents,
    'docs_with_topics': docs_with_topics,
    'model_scores': {
        'lda_scores': lda_scores,
        'bertopic_scores': bertopic_scores
    }
}

# Save for next notebook
save_processed_data(topic_results, "topic_results.pkl")

Saved processed data: ../data/processed/topic_results.pkl


## LLM Usage

**Pros of LLM:**

* **Scale:** Can process large volumes of text quickly

* **Consistency**: Applies same criteria across all documents

* **Zero-shot Classification**: No need for labeled training data

* Captured multiple relevant business themes

<br>

**Cons of LLM:**

* **Hallucination risk:** May confidently assign themes that don't actually exist

* **Black box decisions:** Cannot explain why specific themes were chosen

* Limited sample size made performance assessment unreliable

* No way to verify if classifications reflect genuine content

* Due to small sample size, even small changes in the splits could make significant impact on the results.

<br>

**Explainability Concerns:**

There is a significant issue of explainability. When I ran the notebooks multiple times, the results change often. And unlike any machine learning model where we can set a random seed, there is no control on this. Consequently, I saw the issues I listed as the last point in the CONS section. And this is hard to explain as to how the model differs every run.

* Cannot trace decision path from text to theme assignment

* Business stakeholders cannot validate reasoning

* Difficult to debug incorrect classifications

* Risk of over-confidence in wrong classifications

<br>

**HALLUCINATION RISKS:**

* May identify themes not present in actual customer comments

* Could create false business insights from non-existent patterns

* Especially risky with small datasets like ours

* No ground truth validation possible

<br>

**COMPARISON TO TRADITIONAL METHODS:**

* **Keyword analysis:** Fully explainable, traceable to specific words

* **Topic modeling:** Shows word clusters, interpretable themes

* **LLM classification:** Higher-level but less transparent