### Topic Modelling 

Overall Research Questions:
1) How do customers define 'good' service, and how does the new script shift those definitions?
2) What aspects of service (clarity, empathy, agent personality) drive variance in sentiment?
3) Does the new script systematically change perceptions or emotional tone, particularly for high-value segments like VOLT?

What are the best topic modelling approaches given the above?

Topic modelling is most relevant to Q1, but relates to Q3 as well.

#### load the data/packages

In [None]:

exec(open('../scripts/setup.py').read()) # load our data/packages
# Text preprocessing functions are already defined in the notebook

Main dataset loaded: (582, 15)
VOLT customers: 241
Non-VOLT customers: 341
Treatment group: 247
Control group: 335


FileNotFoundError: [Errno 2] No such file or directory: '../scripts/text_preprocessing.py'

In [15]:
df.head()

Unnamed: 0,GROUP,VOLT_FLAG,SURVEY_ID,SCORE,LTR_COMMENT,PRIMARY_REASON,MONTH,CONNECTION_TIME,SALES_PERSON_SAT,SALES_FRIENDLY_SAT,COMMINICATION_SAT,FIRST_BILL_SAT,AGENT_KNOWLEDGE,VOLT_FLAG_BINARY,TREATMENT_BINARY
45,control,,352240580,10,Good package,,2023-03-01,10,10.0,8,10,10,10,0,0
46,control,yes,351664275,10,Very good customer service,"Customer Service,General,UK Legacy",2023-03-01,10,10.0,10,10,10,10,1,0
47,control,yes,351723391,10,So far so good. Charlie was very efficient and...,,2023-03-01,10,,10,10,10,10,1,0
48,control,,351702901,10,Great communication,"Customer Service,General,UK Legacy",2023-03-01,9,10.0,10,10,10,10,0,0
49,control,yes,352243612,10,Because Chris was amazing when she contacted m...,"Customer Service,UK Legacy",2023-03-01,10,,10,10,10,10,1,0


#### Let's randomly sample some text responses to build a suitable approach

In [13]:
# randomly print a sample from non-missing LTR_COMMENT in df
sample_size = 20
sample = df[df['LTR_COMMENT'].notna()]['LTR_COMMENT'].sample(n=sample_size, random_state=45).tolist()
for i, comment in enumerate(sample, 1):
    print(f"Sample {i}: {comment}\n")

Sample 1: The gentleman who dealt us was so helpful and friendly.. he deserves a lot of credit. Very nice man..

Sample 2: Very prompt response

Sample 3: Good communication and very helpful

Sample 4: Your customer service is awful and you really need to ditch the overseas call centers. Kindness and understanding in the voice go a long way. You can feel the contempt in operator voices (sniggering and laughing when you say you cannot understnd what they are saying - feels intentional) and cannot quite get all of the words due to very strong accents. Your only saving grace is the physical connection and stability of that. Thankfully, because connection quality is good I rarely (if ever - Thank Goodness) have to contact your call centers. Would not touch with a barge pole otherwise. Take a look at Now Broadband whos operators are absolutely marvellous. Lessons learned or brushed under the carpet? Haha

Sample 5: Excellent customer service

Sample 6: Simple process with very helpful staff

Comments comprise long and short responses. Topic modelling is more nuanced for longer responses, even though short responses can reflect similar (but more simple) themes. 
A suitable first step is to clean the responses to remove punctuation and unusual characters.

In [None]:
# Add the following as a module later on

import pandas as pd
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# Download required NLTK data
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)  
nltk.download('omw-1.4', quiet=True)

def clean_text(text, return_tokens=False, min_word_length=2, language='english'):
    """
    Clean a single text string for NLP analysis.
    
    Parameters:
    -----------
    text : str
        Input text string to clean
    return_tokens : bool, default False
        If True, returns list of tokens; if False, returns cleaned text string
    min_word_length : int, default 2
        Minimum word length to keep
    language : str, default 'english'
        Language for stopwords
    
    Returns:
    --------
    str or list
        Cleaned text string or list of tokens
    """
    # Initialize components
    stop_words = set(stopwords.words(language))
    lemmatizer = WordNetLemmatizer()
    
    # Handle missing/empty text
    if pd.isna(text) or not str(text).strip():
        return [] if return_tokens else ''
    
    # Convert to string and lowercase
    text = str(text).lower()
    
    # Remove punctuation and normalize whitespace
    text = re.sub(r'[^\w\s]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenize, filter stopwords and short words
    tokens = [word for word in text.split() 
              if word not in stop_words and len(word) >= min_word_length]
    
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return tokens if return_tokens else ' '.join(tokens)

def clean_text_series(series, return_tokens=False, min_word_length=2, language='english'):
    """
    Apply text cleaning to a pandas Series.
    
    Parameters:
    -----------
    series : pandas.Series
        Series containing text data
    return_tokens : bool, default False
        If True, returns list of tokens; if False, returns cleaned text string
    min_word_length : int, default 2
        Minimum word length to keep
    language : str, default 'english'
        Language for stopwords
    
    Returns:
    --------
    pandas.Series
        Series with cleaned text
    """
    return series.apply(lambda x: clean_text(x, return_tokens, min_word_length, language))

def add_cleaned_text_columns(df, text_columns, suffix='_clean', 
                           min_word_length=2, language='english', 
                           add_tokens=True, add_word_count=True):
    """
    Add cleaned text columns to dataframe for multiple text columns.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Input dataframe
    text_columns : str or list
        Column name(s) containing text to clean
    suffix : str, default '_clean'
        Suffix for cleaned text columns
    min_word_length : int, default 2
        Minimum word length to keep
    language : str, default 'english'
        Language for stopwords
    add_tokens : bool, default True
        Whether to add tokenized version
    add_word_count : bool, default True
        Whether to add word count column
        
    Returns:
    --------
    pandas.DataFrame
        Dataframe with additional cleaned text columns
    """
    df_result = df.copy()
    
    # Ensure text_columns is a list
    if isinstance(text_columns, str):
        text_columns = [text_columns]
    
    for col in text_columns:
        if col not in df.columns:
            continue
            
        # Add cleaned text column
        clean_col = f"{col}{suffix}"
        df_result[clean_col] = clean_text_series(
            df[col], return_tokens=False, 
            min_word_length=min_word_length, language=language
        )
        
        # Add tokenized version
        if add_tokens:
            tokens_col = f"{col}_tokens"
            df_result[tokens_col] = clean_text_series(
                df[col], return_tokens=True,
                min_word_length=min_word_length, language=language
            )
        
        # Add word count
        if add_word_count:
            count_col = f"{col}_word_count"
            df_result[count_col] = df_result[clean_col].str.split().str.len().fillna(0)
    
    return df_result

# Convenience function for common preprocessing pipeline
def preprocess_text_for_modeling(df, text_columns, min_word_length=2, 
                                language='english', filter_empty=True):
    """
    Complete preprocessing pipeline for text modeling.
    
    Parameters:
    -----------
    df : pandas.DataFrame
        Input dataframe
    text_columns : str or list
        Column name(s) containing text to clean
    min_word_length : int, default 2
        Minimum word length to keep
    language : str, default 'english'
        Language for stopwords
    filter_empty : bool, default True
        Whether to add flag for non-empty text
        
    Returns:
    --------
    pandas.DataFrame
        Preprocessed dataframe ready for modeling
    """
    # Clean text
    df_clean = add_cleaned_text_columns(
        df, text_columns, min_word_length=min_word_length, 
        language=language, add_tokens=True, add_word_count=True
    )
    
    # Add flags for non-empty text
    if filter_empty:
        if isinstance(text_columns, str):
            text_columns = [text_columns]
        
        for col in text_columns:
            clean_col = f"{col}_clean"
            flag_col = f"has_{col}"
            df_clean[flag_col] = (df_clean[clean_col].str.len() > 0)
    
    return df_clean

In [32]:
# Simple side-by-side comparison of original vs cleaned comments
sample_size = 10

# Get sample of non-missing comments
sample_df = df[df['LTR_COMMENT'].notna()].sample(n=sample_size, random_state=42)

print("=== ORIGINAL vs CLEANED COMMENTS ===\n")

for i, (idx, row) in enumerate(sample_df.iterrows(), 1):
    print(f"--- Sample {i} ---")
    print(f"ORIGINAL: {row['LTR_COMMENT']}")
    print(f"CLEANED:  {row['LTR_COMMENT_CLEAN']}")
    print("-" * 80)


=== ORIGINAL vs CLEANED COMMENTS ===

--- Sample 1 ---
ORIGINAL: Percy explained things clearly and understood from the start what I wanted. there was no hard sell. there was listening and understanding. top marks Percy and thank you
CLEANED:  percy explained thing clearly understood start wanted hard sell listening understanding top mark percy thank
--------------------------------------------------------------------------------
--- Sample 2 ---
ORIGINAL: Excellent service lovely representative Mathius
CLEANED:  excellent service lovely representative mathius
--------------------------------------------------------------------------------
--- Sample 3 ---
ORIGINAL: Very friendly staff and accommodating
CLEANED:  friendly staff accommodating
--------------------------------------------------------------------------------
--- Sample 4 ---
ORIGINAL: Very difficult to talk to anyone, very expensive, always an IT problem, too many calls, lots of mistakes took place with my name, dates leav

Unnamed: 0,GROUP,VOLT_FLAG,SURVEY_ID,SCORE,LTR_COMMENT,PRIMARY_REASON,MONTH,CONNECTION_TIME,SALES_PERSON_SAT,SALES_FRIENDLY_SAT,COMMINICATION_SAT,FIRST_BILL_SAT,AGENT_KNOWLEDGE,VOLT_FLAG_BINARY,TREATMENT_BINARY,LTR_COMMENT_CLEAN
45,control,,352240580,10,Good package,,2023-03-01,10,10.0,8,10,10,10,0,0,good package
46,control,yes,351664275,10,Very good customer service,"Customer Service,General,UK Legacy",2023-03-01,10,10.0,10,10,10,10,1,0,good customer service
47,control,yes,351723391,10,So far so good. Charlie was very efficient and...,,2023-03-01,10,,10,10,10,10,1,0,far good charlie efficient helpful let hope co...
48,control,,351702901,10,Great communication,"Customer Service,General,UK Legacy",2023-03-01,9,10.0,10,10,10,10,0,0,great communication
49,control,yes,352243612,10,Because Chris was amazing when she contacted m...,"Customer Service,UK Legacy",2023-03-01,10,,10,10,10,10,1,0,chris amazing contacted put detail online
