### Topic Modelling 

Overall Research Questions:
1) How do customers define 'good' service, and how does the new script shift those definitions?
2) What aspects of service (clarity, empathy, agent personality) drive variance in sentiment?
3) Does the new script systematically change perceptions or emotional tone, particularly for high-value segments like VOLT?

What are the best topic modelling approaches given the above?

In addition, we'll focus on the following:

- What are the top latent topics mentioned?
- How do topic distributions differ by treatment?
- What percentage of comments mention agent personality, clarity, or reassurance?

Topic modelling is most relevant to Q1, but relates to Q3 as well.

#### load the data/packages

In [54]:

exec(open('../scripts/setup.py').read()) # load our data/packages
# Text preprocessing functions are already defined in the notebook

Main dataset loaded: (582, 15)
VOLT customers: 241
Non-VOLT customers: 341
Treatment group: 247
Control group: 335


In [55]:
df.head()

Unnamed: 0,GROUP,VOLT_FLAG,SURVEY_ID,SCORE,LTR_COMMENT,PRIMARY_REASON,MONTH,CONNECTION_TIME,SALES_PERSON_SAT,SALES_FRIENDLY_SAT,COMMINICATION_SAT,FIRST_BILL_SAT,AGENT_KNOWLEDGE,VOLT_FLAG_BINARY,TREATMENT_BINARY
45,control,,352240580,10,Good package,,2023-03-01,10,10.0,8,10,10,10,0,0
46,control,yes,351664275,10,Very good customer service,"Customer Service,General,UK Legacy",2023-03-01,10,10.0,10,10,10,10,1,0
47,control,yes,351723391,10,So far so good. Charlie was very efficient and...,,2023-03-01,10,,10,10,10,10,1,0
48,control,,351702901,10,Great communication,"Customer Service,General,UK Legacy",2023-03-01,9,10.0,10,10,10,10,0,0
49,control,yes,352243612,10,Because Chris was amazing when she contacted m...,"Customer Service,UK Legacy",2023-03-01,10,,10,10,10,10,1,0


#### Let's randomly sample some text responses to build a suitable approach

In [115]:
# randomly print a sample from non-missing LTR_COMMENT in df
sample_size = 20
sample = df[df['LTR_COMMENT'].notna()]['LTR_COMMENT'].sample(n=sample_size, random_state=46).tolist()
for i, comment in enumerate(sample, 1):
    print(f"Sample {i}: {comment}\n")

Sample 1: Very helpful staff

Sample 2: Very polite engineer who consulted every step and very patient

Sample 3: The experience is really good. The price can be a bit expensive. But I have had good experience with company media so far. The engineer, who came to my address was very professional and friendly. And did a really good job running the wires nice and tidy. Exactly how I wanted.

Sample 4: Good value reasonable prices

Sample 5: Very efficient and helpful

Sample 6: The lady I spoke to was very helpful.

Sample 7: Because it's too fast

Sample 8: Just felt treated better than with sky

Sample 9: Your customer service is awful and you really need to ditch the overseas call centers. Kindness and understanding in the voice go a long way. You can feel the contempt in operator voices (sniggering and laughing when you say you cannot understnd what they are saying - feels intentional) and cannot quite get all of the words due to very strong accents. Your only saving grace is the phys

Comments comprise long and short responses. Topic modelling is more nuanced for longer responses, even though short responses can reflect similar (but more simple) themes. 
A suitable first step is to clean the responses to remove punctuation and unusual characters.

In [57]:
# Add the following as a module later on

import pandas as pd
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)  
nltk.download('omw-1.4', quiet=True)

def clean_text(text, return_tokens=False, min_word_length=2, language='english'):
    """
    Clean a single text string for NLP analysis.
    
    Parameters:
    -----------
    text : str
        Input text string to clean
    return_tokens : bool, default False
        If True, returns list of tokens; if False, returns cleaned text string
    min_word_length : int, default 2
        Minimum word length to keep
    language : str, default 'english'
        Language for stopwords
    
    Returns:
    --------
    str or list
        Cleaned text string or list of tokens
    """
    # Initialize components
    stop_words = set(stopwords.words(language))
    lemmatizer = WordNetLemmatizer()
    
    # Handle missing/empty text
    if pd.isna(text) or not str(text).strip():
        return [] if return_tokens else ''
    
    # Convert to string and lowercase
    text = str(text).lower()
    
    # Remove punctuation and normalize whitespace
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenize, filter stopwords and short words
    tokens = [word for word in text.split() 
              if word not in stop_words and len(word) >= min_word_length]
    
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return tokens if return_tokens else ' '.join(tokens)

def clean_text_series(series, return_tokens=False, min_word_length=2, language='english'):
    """
    Apply text cleaning to a pandas Series.
    
    Parameters:
    -----------
    series : pandas.Series
        Series containing text data
    return_tokens : bool, default False
        If True, returns list of tokens; if False, returns cleaned text string
    min_word_length : int, default 2
        Minimum word length to keep
    language : str, default 'english'
        Language for stopwords
    
    Returns:
    --------
    pandas.Series
        Series with cleaned text
    """
    return series.apply(lambda x: clean_text(x, return_tokens, min_word_length, language))

def add_cleaned_text_columns(df, text_columns, suffix='_clean', 
                           min_word_length=2, language='english', 
                           add_tokens=True, add_word_count=True):
    """
    Add cleaned text columns to dataframe for multiple text columns.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Input dataframe
    text_columns : str or list
        Column name(s) containing text to clean
    suffix : str, default '_clean'
        Suffix for cleaned text columns
    min_word_length : int, default 2
        Minimum word length to keep
    language : str, default 'english'
        Language for stopwords
    add_tokens : bool, default True
        Whether to add tokenized version
    add_word_count : bool, default True
        Whether to add word count column
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with additional cleaned text columns
    """
    df_result = df.copy()
    
    # Ensure text_columns is a list
    if isinstance(text_columns, str):
        text_columns = [text_columns]
    
    for col in text_columns:
        if col not in df.columns:
            continue
            
        # Add cleaned text column
        clean_col = f"{col}{suffix}"
        df_result[clean_col] = clean_text_series(
            df[col], return_tokens=False, 
            min_word_length=min_word_length, language=language
        )
        
        # Add tokenized version
        if add_tokens:
            tokens_col = f"{col}_tokens"
            df_result[tokens_col] = clean_text_series(
                df[col], return_tokens=True,
                min_word_length=min_word_length, language=language
            )
        
        # Add word count
        if add_word_count:
            count_col = f"{col}_word_count"
            df_result[count_col] = df_result[clean_col].str.split().str.len().fillna(0)
    
    return df_result

# Convenience function for common preprocessing pipeline
def preprocess_text_for_modeling(df, text_columns, min_word_length=2, 
                                language='english', filter_empty=True):
    """
    Complete preprocessing pipeline for text modeling.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Input dataframe
    text_columns : str or list
        Column name(s) containing text to clean
    min_word_length : int, default 2
        Minimum word length to keep
    language : str, default 'english'
        Language for stopwords
    filter_empty : bool, default True
        Whether to add flag for non-empty text
        
    Returns:
    --------
    pandas.DataFrame
        Preprocessed dataframe ready for modeling
    """
    # Clean text
    df_clean = add_cleaned_text_columns(
        df, text_columns, min_word_length=min_word_length, 
        language=language, add_tokens=True, add_word_count=True
    )
    
    # Add flags for non-empty text
    if filter_empty:
        if isinstance(text_columns, str):
            text_columns = [text_columns]
        
        for col in text_columns:
            clean_col = f"{col}_clean"
            flag_col = f"has_{col}"
            df_clean[flag_col] = (df_clean[clean_col].str.len() > 0)
    
    return df_clean

In [59]:
df['LTR_COMMENT_CLEAN'] = clean_text_series(df['LTR_COMMENT'], return_tokens=False, min_word_length=2, language='english')

In [60]:
# Simple side-by-side comparison of original vs cleaned comments
sample_size = 10

# Get sample of non-missing comments
sample_df = df[df['LTR_COMMENT'].notna()].sample(n=sample_size, random_state=42)

print("=== ORIGINAL vs CLEANED COMMENTS ===\n")

for i, (idx, row) in enumerate(sample_df.iterrows(), 1):
    print(f"--- Sample {i} ---")
    print(f"ORIGINAL: {row['LTR_COMMENT']}")
    print(f"CLEANED:  {row['LTR_COMMENT_CLEAN']}")
    print("-" * 80)


=== ORIGINAL vs CLEANED COMMENTS ===

--- Sample 1 ---
ORIGINAL: Percy explained things clearly and understood from the start what I wanted. there was no hard sell. there was listening and understanding. top marks Percy and thank you
CLEANED:  percy explained thing clearly understood start wanted hard sell listening understanding top mark percy thank
--------------------------------------------------------------------------------
--- Sample 2 ---
ORIGINAL: Excellent service lovely representative Mathius
CLEANED:  excellent service lovely representative mathius
--------------------------------------------------------------------------------
--- Sample 3 ---
ORIGINAL: Very friendly staff and accommodating
CLEANED:  friendly staff accommodating
--------------------------------------------------------------------------------
--- Sample 4 ---
ORIGINAL: Very difficult to talk to anyone, very expensive, always an IT problem, too many calls, lots of mistakes took place with my name, dates leav

We follow a standard text cleaning algorithm as exampled in the above. We also remove stop-words that contain no semantic information. 

#### BERT Topic Modelling

In [106]:
# BERTopic Implementation
import pandas as pd
import numpy as np
from bertopic import BERTopic

# Load your data (assuming df is already loaded with LTR_COMMENT_CLEAN column)
# df = pd.read_pickle('../data/processed/cleaned_call_script_data.pkl')

# Step 1: Data preparation - filter out null/missing comments
df_clean = (df
    .loc[df['LTR_COMMENT_CLEAN'].notna()]  # Remove missing comments
    .loc[df['LTR_COMMENT_CLEAN'].str.len() >= 20]  # Remove very short comments
    .reset_index(drop=True)  # Reset index after filtering
    .copy()
)

print(f"Original data shape: {df.shape}")
print(f"Clean data shape: {df_clean.shape}")
print(f"Number of comments for topic modeling: {len(df_clean)}")


Original data shape: (582, 16)
Clean data shape: (349, 16)
Number of comments for topic modeling: 349


Should we also do lemmatization and stemming?

Need to think carefully about removing stop words. For our topic extraction to be useful, we need to remove:
1) Generic english function words
2) Additional function words that come up often in our sample
3) 

We need to be careful to strike a balance within our stop word filtering. Too much and we risk diluting the data too excessively, too little and we'll end up capturing topics that aren't particularly useful, or say much at all about the concrete 

In [107]:

# Step 2: Text preprocessing - conservative stopword removal for short feedback
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# Conservative stopword removal - only remove highly frequent, non-topical words
# KEEP: sentiment words (terrible, great, bad, lovely, thanks)
# KEEP: action/domain words (helped, agent, product, resolved, call, phone)
# REMOVE: only function words and universal domain terms if truly universal

minimal_stopwords = {
    # Core function words
    'the', 'is', 'was', 'are', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
    'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might', 'must',
    'shall', 'can', 'am', 'i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 
    'her', 'us', 'them', 'my', 'your', 'his', 'its', 'our', 'their', 'this', 'that',
    'these', 'those', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
    'of', 'with', 'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before',
    'after', 'above', 'below', 'between', 'among', 'within', 'without', 'against',
    
    # Additional function words spotted in your samples
    'so', 'as', 'if', 'when', 'where', 'why', 'how', 'what', 'who', 'which', 'than',
    'then', 'now', 'here', 'there', 'get', 'got', 'go', 'went', 'come', 'came',
    'said', 'say', 'told', 'tell', 'also', 'really', 'very', 'quite', 'just', 'even',
    'still', 'yet', 'already', 'always', 'never', 'sometimes', 'often',
    
    # Negative sentiment words
    'bad', 'awful', 'terrible', 'horrible', 'shocking', 'disappointing', 'poor',
    'unhelpful', 'rude', 'difficult', 'complicated', 'slow', 'delayed', 'frustrated',
    'angry', 'upset', 'confused', 'misleading', 'expensive', 'overpriced', 'unfair',
    'unacceptable', 'inadequate', 'unsatisfactory', 'problematic', 'troublesome',
    'dishonest', 'unprofessional', 'incompetent', 'useless', 'pathetic', 'ridiculous',
    
    # Positive sentiment words
    'good', 'great', 'excellent', 'amazing', 'fantastic', 'wonderful', 'brilliant', 
    'helpful', 'friendly', 'professional', 'efficient', 'polite', 'pleasant', 
    'straightforward', 'simple', 'easy', 'quick', 'fast', 'smooth', 'satisfied', 
    'happy', 'impressed', 'pleased', 'lovely', 'nice', 'perfect', 'outstanding',
    'superb', 'awesome', 'terrific', 'exceptional', 'skilled', 'knowledgeable',
    'patient', 'understanding', 'reassuring', 'confident', 'reliable', 'trustworthy',
    
    # Common non-adjectives
    'the', 'and', 'for', 'ing', 'ed', 'ly', 'able', 'having', 'being', 'doing', 
    'going', 'coming', 'getting', 'looking', 'working', 'talking', 'calling',
    'waiting', 'trying', 'running', 'using', 'taking', 'making', 'asked',
    'said', 'told', 'found', 'needed', 'wanted', 'called', 'received'
}

# Only add domain-specific terms if they appear in ALL feedback and add no distinction
# Be very conservative here - check your data first!

def preprocess_text(text):
    """Minimal preprocessing for short customer feedback"""
    if pd.isna(text):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Light cleaning - remove only obvious noise, keep most punctuation context
    text = re.sub(r'[^\w\s\']', ' ', text)  # Keep apostrophes for contractions
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    # Conservative stopword removal - only remove function words
    words = text.split()
    words = [word for word in words if word not in minimal_stopwords and len(word) > 1]
    
    return ' '.join(words)

# Apply preprocessing
df_clean['text_processed'] = df_clean['LTR_COMMENT_CLEAN'].apply(preprocess_text)

# Filter out empty processed texts
df_clean = df_clean[df_clean['text_processed'].str.len() > 0].copy()

# Prepare final documents for topic modeling
docs = df_clean['text_processed'].tolist()

print(f"After preprocessing: {len(docs)} documents ready for topic modeling")


# Print comparison of original vs cleaned comments
print("\n=== ORIGINAL vs CLEANED COMMENTS ===\n")
for i, (idx, row) in enumerate(df_clean.sample(n=10, random_state=42).iterrows(), 1):
    print(f"--- Sample {i} ---")
    print(f"ORIGINAL: {row['LTR_COMMENT']}")
    print(f"CLEANED:  {row['text_processed']}")
    print("-" * 80)

After preprocessing: 347 documents ready for topic modeling

=== ORIGINAL vs CLEANED COMMENTS ===

--- Sample 1 ---
ORIGINAL: Great benefits but pricey
CLEANED:  benefit pricey
--------------------------------------------------------------------------------
--- Sample 2 ---
ORIGINAL: Although the initial inquiry was very smooth problems have arisen and are ongoing
CLEANED:  although initial inquiry problem arisen ongoing
--------------------------------------------------------------------------------
--- Sample 3 ---
ORIGINAL: Everything sorted easily
CLEANED:  everything sorted easily
--------------------------------------------------------------------------------
--- Sample 4 ---
ORIGINAL: Z, who guided me through the process on the phone, was the epitome of empathetic warm professionalism. If all of your customer service personnel are as good as her and your products remain competitive, I see myself becoming a long-term loyal customer.
CLEANED:  guided process phone epitome empathet

#### Why use BERT Model?

Focus on embeddings makes it robust to typos and messiness of our raw data.
Allows vector similarity

In [111]:
# Step 3: Initialize BERTopic model
# BERT embeddings and clustering
topic_model = BERTopic(
    verbose=True,
    calculate_probabilities=True,  # Enable probability calculation
    nr_topics='auto'  # Let BERTopic determine optimal number of topics
)

In [112]:

# Step 4: Fit the model and get topics
print("Fitting BERTopic model...")
topics, probs = topic_model.fit_transform(docs)

# Step 5: Add results back to dataframe
df_clean = (df_clean
    .assign(
        topic=topics,
        topic_prob=probs.max(axis=1) if hasattr(probs, 'max') else [max(p) for p in probs]
    )
)


2025-08-03 16:46:43,442 - BERTopic - Embedding - Transforming documents to embeddings.


Fitting BERTopic model...


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

2025-08-03 16:46:45,252 - BERTopic - Embedding - Completed ✓
2025-08-03 16:46:45,253 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-03 16:46:45,370 - BERTopic - Dimensionality - Completed ✓
2025-08-03 16:46:45,370 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-08-03 16:46:45,384 - BERTopic - Cluster - Completed ✓
2025-08-03 16:46:45,385 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-08-03 16:46:45,394 - BERTopic - Representation - Completed ✓
2025-08-03 16:46:45,395 - BERTopic - Topic reduction - Reducing number of topics
2025-08-03 16:46:45,397 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-08-03 16:46:45,405 - BERTopic - Representation - Completed ✓
2025-08-03 16:46:45,406 - BERTopic - Topic reduction - Reduced number of topics from 7 to 7


In [113]:

# Step 6: Display topic information
print(f"\nNumber of topics generated: {len(topic_model.get_topic_info())}")
print("\nTopic Information:")
topic_info = topic_model.get_topic_info()
print(topic_info.head(10))



Number of topics generated: 7

Topic Information:
   Topic  Count                                     Name                                     Representation                                Representative_Docs
0     -1     59      -1_agent_everything_package_service  [agent, everything, package, service, explaine...  [agent clearly stated everything know took who...
1      0    152                 0_company_call_phone_day  [company, call, phone, day, service, broadband...  [placing order online free installation activa...
2      1     58     1_service_customer_company_stressful  [service, customer, company, stressful, covera...  [customer service, customer service, customer ...
3      2     27        2_explained_everything_well_fully  [explained, everything, well, fully, clara, ex...  [explained thing well, everything explained we...
4      3     21  3_staff_process_product_professionalism  [staff, process, product, professionalism, kne...              [service staff, staff, process st

In [114]:

# Step 7: Show representative documents for each topic
print("\nSample documents by topic:")
for topic_id in topic_info['Topic'][:5]:  # Show first 5 topics
    if topic_id != -1:  # Skip outlier topic
        print(f"\n--- Topic {topic_id} ---")
        sample_docs = df_clean[df_clean['topic'] == topic_id]['LTR_COMMENT_CLEAN'].head(3)
        for i, doc in enumerate(sample_docs, 1):
            print(f"{i}. {doc[:100]}...")



Sample documents by topic:

--- Topic 0 ---
1. chris amazing contacted put detail online...
2. price bundle went day placed order...
3. simple straightforward process lot information thing remember phone...

--- Topic 1 ---
1. good customer service...
2. amazing service job completed...
3. hood customer service...

--- Topic 2 ---
1. far good charlie efficient helpful let hope continues confident...
2. nice pleasent aswell...
3. alright reasonably good...

--- Topic 3 ---
1. simple process helpful staff...
2. really friendly experienced staff...
3. competent representative guide joining companyia medium...


In [108]:

# Step 8: Topic reduction (if needed)
print(f"\nOriginal number of topics: {len(topic_model.get_topic_info())}")

# Reduce topics if there are too many (following guide's example)
if len(topic_model.get_topic_info()) > 20:
    print("Reducing number of topics...")
    topic_model.reduce_topics(docs, nr_topics=15)
    
    # Update topics after reduction
    topics_reduced = topic_model.topics_
    df_clean['topic_reduced'] = topics_reduced
    
    print(f"Reduced number of topics: {len(topic_model.get_topic_info())}")



Original number of topics: 7


Some things to consider:

- We can remove all sentiment related words, does this improve concrete topic extraction?
- Substantively, should our topics include sentiment related words? What's a justifiable methodology for our three questions?

Should dedicate an entire section to outlining the methodology surrounding stop words. At times, adjectives are useful. And at times, they aren't.

- It may be possible to do a continuous classification, rather than a discrete classification of each comment into every category. And then apply a threshold based on cosine similarity etc.

Need to look into fine-tuning the BERTopic model or using a different approach, current topics just aren't substantively relevant at the moment. Need to fine-tune based on addressing the specific segmentation questions.

Add an LLM-based implementation. May be more adaptive to the nuances of our data.

Should also look into entity analysis, which can extract the proper nouns and things within each response, and then pass these onto the topic/theme model. [text](https://cloud.google.com/natural-language/docs/analyzing-entities)