#### Comment pre-processing

For both topic extraction and sentiment analysis modelling, it would be useful to firstly correct the typos, non-regular characters and other abnormal appearances in the raw comment texts.

At this stage, we should keep punctuation, capitalization, information language, misspellings, and emotional expressions. However, we do want to correct things like:
- character encoding issues: e.g. "wÃ¡s" → "was",  "I don¬"t know much" → "I don't know much"
- Unicode encoding artifacts
- Excessive whitespace
- comments that are literally 'na' or some variation
- typo fixes

Following this step, we can be sure in the integrity of our text data for further processing adapted to topic modelling/sentiment analysis.

Firstly, we'll apply some function based methods to address the unicode artifacts, character encoding, and na's.

Next, we can rely on an LLM for fixing basic typos, while making sure to check any changed responses. If scaling this approach, a lightweight model would be used, but in our small sample, we can do a many-shot run with a model like local deepseek-r1, checking any changes. Likewise, both topic modelling and sentiment analysis models are often robust to typos (often through the use of embeddings models) so this step might not be necessary. (UPDATED: Step deemed too time-inefficient to do)

In [18]:
exec(open('../scripts/setup.py').read()) # load our data/packages

Main dataset loaded: (582, 15)


In [19]:
# randomly print a sample from non-missing LTR_COMMENT in df
sample_size = 70
sample = df[df['LTR_COMMENT'].notna()]['LTR_COMMENT'].sample(n=sample_size, random_state=47).tolist()
for i, comment in enumerate(sample, 1):
    print(f"Sample {i}: {comment}\n")

Sample 1: Very hardworking and punctual quits who carried out the work

Sample 2: I got a good offer with company than sky

Sample 3: Is a good service but trying to get to talk to a human person is really hhard , the people you use do not use very good english therefore were not easy to communicate with

Sample 4: Very polite and informative

Sample 5: Cut my Internet off . Asked for router to be sent after 5.00 tracker shows between 10 and 12

Sample 6: The gentleman who dealt us was so helpful and friendly.. he deserves a lot of credit. Very nice man..

Sample 7: you are very efficient=====will you be moving sat dish when you install

Sample 8: How easy it was to set up.

Sample 9: Very adaptable and patient/ helpful

Sample 10: I've not agreed to any of this yet I'm bombarded with emails that you cannot reply to and setting up direct debits without my knowledge

Sample 11: They are efficient

Sample 12: Extremely disappointed with your customer service team. I am still awaiting a r

#### Step 1 - Addressing rule-based corrections 

In [20]:
import re

def fix_encoding_issues(text):
    # Check if text is NaN or None first
    if pd.isna(text) or text is None:
        return None
    
    # Fix common encoding problems
    text = text.replace('¬"', "'")  # Samples 14, 15, 34, 52, 62, 64, 67
    text = text.replace('Ã¡', 'a')
    text = text.replace('Ã©', 'e')
    return text

def handle_incomplete_responses(text):
    # Check if text is NaN or None first
    if pd.isna(text) or text is None:
        return None
    
    # Handle very short non-informative responses
    short_responses = ['Na', 'na', 'N/A', 'n/a', 'NA', 'n.a.', 'N.A.', 'None', 'none', 'No comment', 'no comment']
    if text.strip() in short_responses:
        return None  # Replace with None for missing data 
    return text

def handle_very_short_responses(text):
    # Check if text is NaN or None first
    if pd.isna(text) or text is None:
        return None
    
    # Handle very short responses that are not informative
    if len(text.strip()) < 3:  # Adjust threshold as needed
        return None  # Use None for missing object data
    return text

def normalize_spacing(text):
    # Check if text is NaN or None first
    if pd.isna(text) or text is None:
        return None
    
    # Handle excessive punctuation/spacing
    text = re.sub(r'=+', ' ', text)  # Sample 7: "efficient====="
    text = re.sub(r'\s+', ' ', text)  # Multiple spaces
    text = text.strip()
    return text

def preprocess_telecom_feedback(text):
    # Check if text is NaN or None first
    if pd.isna(text) or text is None:
        return None
    
    # Convert to string if it's not already (handles edge cases)
    text = str(text)
    
    # Apply all preprocessing steps
    text = fix_encoding_issues(text)
    if text is None:
        return None
        
    text = handle_incomplete_responses(text)
    if text is None:
        return None
        
    text = handle_very_short_responses(text)
    if text is None:
        return None
        
    text = normalize_spacing(text)
    
    return text


In [21]:
# Apply the functions to the DataFrame
df['LTR_COMMENT_CLEAN'] = df['LTR_COMMENT'].apply(fix_encoding_issues).apply(handle_incomplete_responses).apply(handle_very_short_responses).apply(normalize_spacing)

# Print our cleaned sample as the same sample from before, but compared to the original sample
for i, comment in enumerate(sample, 1):
    cleaned_comment = df[df['LTR_COMMENT'] == comment]['LTR_COMMENT_CLEAN'].values[0]
    print(f"Original Sample {i}: {comment}\nCleaned Sample {i}: {cleaned_comment}\n")

Original Sample 1: Very hardworking and punctual quits who carried out the work
Cleaned Sample 1: Very hardworking and punctual quits who carried out the work

Original Sample 2: I got a good offer with company than sky
Cleaned Sample 2: I got a good offer with company than sky

Original Sample 3: Is a good service but trying to get to talk to a human person is really hhard , the people you use do not use very good english therefore were not easy to communicate with
Cleaned Sample 3: Is a good service but trying to get to talk to a human person is really hhard , the people you use do not use very good english therefore were not easy to communicate with

Original Sample 4: Very polite and informative
Cleaned Sample 4: Very polite and informative

Original Sample 5: Cut my Internet off . Asked for router to be sent after 5.00 tracker shows between 10 and 12
Cleaned Sample 5: Cut my Internet off . Asked for router to be sent after 5.00 tracker shows between 10 and 12

Original Sample 6: T

#### Step 2: Using an spell-checker to correct typos

Tried correcting typos via both local LLM (Deepseek-r1) and using pyspellchecker. Both introduced further errors and either miscorrected certain words or would randomly add non-useful strings of text.

The gains from correcting typos not worth the time investment so we'll proceed with them still in the data. We'll choose models robust to typos in topic modelling and sentiment analysis.

In [22]:
# Save our adjusted data
df.to_pickle('../data/processed/cleaned_call_script_data.pkl')
df.to_csv('../data/processed/cleaned_call_script_data.csv', index=False)