Skip to content

DaveAhearne/NLPAnalysis

Repository files navigation

Applying NLP for Topic Modelling in a Real-Life Context

A topic project completed as part of the Cambridge Data Science Career Accelerator (C301, Weeks 4 & 5).

The project applies a multi-method NLP pipeline to two independent PureGym review datasets — Google Reviews and TrustPilot — to identify customer pain points and generate actionable improvement recommendations.

Problem Statement

Despite strong brand performance, PureGym faces unresolved customer issues that risk damaging retention and revenue. Manually reviewing large volumes of customer feedback is impractical. Applying NLP can systematically surface the root causes of dissatisfaction, enabling targeted fixes that are more likely to improve customer satisfaction and reduce reliance on discounting as a retention strategy.

Datasets

Two review datasets covering 12 months of PureGym feedback:

Dataset Unique Locations Notes
Google 512 Scored 1–5 via Overall Score
TrustPilot 377 Scored 1–5 via Stars

310 locations were common to both datasets. Negative reviews were defined as scores below 3 in both sources.

Language detection was performed using langdetect to obtain appropriate stopword lists. Non-English and unknown language reviews (including short text flagged as Welsh) were excluded from analysis.

Methods

The analysis was structured in progressive stages, from simple statistical exploration through to semantic modelling and LLM-assisted insight generation.

1. Basic Text Analysis

Text was lowercased, tokenised using NLTK's word_tokenize, and stopwords were removed. Frequency distributions and word clouds were generated for all reviews and negative reviews separately, across both datasets.

Both datasets showed a broadly positive tone overall, with high-frequency terms such as "equipment", "great", "good", "clean", and "friendly". Isolating negative reviews shifted vocabulary toward "equipment", "staff", "machines", and "membership", providing initial signal but without directional context — word frequency alone cannot capture sentiment.

2. BERTopic — Full Dataset

BERTopic was applied to the combined review corpus using default parameters and then with hyperparameter tuning. The basic model produced a large number of small, fragmented topics with a high proportion of uncategorised reviews (topic -1: 7,809). After tuning, the uncategorised count reduced to 6,204, topic counts grew, and the intertopic distance map showed clearer separation.

The top 10 topics from the tuned model were labelled using an LLM (Falcon-7b-instruct), producing the following themes:

Theme Summary
Equipment Variety and availability of weights, machines, and benches
Facilities Temperature and comfort of showers and changing rooms
Hygiene Cleanliness and odour of toilets and changing rooms
Parking Availability and convenience of free parking
Space Ample equipment areas but crowding concerns
Climate Air conditioning, saunas, temperature control
Membership Joining processes, fees, codes, and account activation
Classes Availability and quality of workout classes

A similarity matrix revealed overlap between equipment-related topics, suggesting shared content across those clusters.

3. Location-Specific Analysis

Locations were ranked by negative review count across both datasets and combined into a cross-source ranking. The top worst-performing locations by total negative review count were:

Location Google TrustPilot Total
London Stratford 59 22 81
London Enfield 25 23 48
London Swiss Cottage 24 15 39
London Seven Sisters 18 16 34
London Bermondsey 16 18 34

High London representation reflects higher overall review volume rather than necessarily worse performance. Word frequency analysis on the top 30 worst-performing locations showed that positive terms like "clean" and "friendly" disappeared compared to the full dataset, though vocabulary shifts were limited without sentiment context.

BERTopic applied to this location subset produced fewer but larger, better-separated topics with clearer LLM-generated summaries, including location-specific issues such as maintenance problems, parking fines, and staff complaints.

4. Emotion Analysis

BERT-based emotion classification was applied to assign one of six emotions (anger, fear, joy, love, sadness, surprise) to each review. The model has no "unclassified" label, which introduces a joy bias — short or ambiguous negative reviews were frequently misclassified as joy. This was confirmed by manual inspection.

Despite this limitation, filtering to angry reviews produced cleaner and more consistent topic signals, with fewer uncategorised reviews and more clearly separated clusters in BERTopic. Anger and sadness dominated negative review emotion distributions, aligning with expectations.

5. BERTopic on Negative Angry Reviews

Restricting the corpus to reviews classified as angry and paired with low scores produced the most actionable topic set, with roughly 40 topics and strong cluster separation. LLM-generated summaries of these topics were unambiguously negative:

Theme Summary
Equipment Availability and maintenance of machines
Attitude Disrespectful and unprofessional staff behaviour
Scheduling Booking and cancellation issues for classes
Cancellation Membership cancellation processes and unexpected billing
Membership Access, enrollment, and cancellation experiences
Pricing Membership fees and unexpected charges
Parking Fines and unclear parking regulations

6. LLM-Assisted Topic Extraction

A random sample of 1,000 negative reviews was passed to Falcon-7b-instruct with a prompt requesting the top 3 themes per review. BERTopic was then applied to the LLM-extracted themes rather than raw review text. This produced more semantically distinct topics and reduced unclassified reviews further, though at significantly higher computational cost.

The LLM was also prompted to generate structured improvement suggestions per theme, producing a directly actionable output.

7. LDA with Gensim and pyLDAvis

LDA was applied as a comparative baseline. Lemmatisation was applied prior to fitting. pyLDAvis was used to inspect topic separation and term relevancy, with the relevancy slider set to approximately 0.7 to balance term commonality and exclusivity. Topics showed mostly strong separation with minor overlap between topics 1, 2, 4, 6, and 9.

LDA provides fast, lightweight topic exploration using a Bayesian bag-of-words approach, but lacks semantic understanding — making it a useful first pass rather than a primary analysis method.

Identified Issues and Recommendations

Seven key improvement areas were identified from the negative angry review analysis:

Issue Area Key Problems Recommendations
Customer Service Unhelpful staff, poor communication, inconsistent responses Staff training, complaint tracking system
Cleanliness and Hygiene Dirty changing rooms, bad odours, no soap or water stations Increased cleaning frequency, restocking amenities
Equipment Availability Broken, worn-out, or insufficient equipment Investment in new equipment, better maintenance schedules
Membership Policies Unexpected charges, difficult cancellation, poor communication Clearer terms, flexible cancellation options
Parking and Access Fines, unclear parking rules, access issues for members Clearer signage, discounted day passes for off-peak visits
Class Availability and Quality Limited evening classes, cancellations without notice Expanded timetables, better cancellation communication
General Gym Experience Overcrowding, loud music, poor atmosphere during peak hours Regular member feedback, peak-hour management strategies

Method Comparison

Method Strengths Limitations
Word Frequency Fast, simple, interpretable No sentiment or context
LDA Lightweight, fast, good for initial exploration Bag-of-words, no semantic understanding
BERTopic Semantic clustering, good topic separation High uncategorised rate on small subsets
Emotion (BERT) Useful for filtering, anger signal is reliable Joy bias due to no "unclassified" label, misclassifies short text
LLM Themes Best semantic quality, directly actionable summaries High computational cost, limited to sampled subset

Libraries

pandas
numpy
nltk
langdetect
wordcloud
matplotlib
seaborn
bertopic
transformers
gensim
pyLDAvis
sentence-transformers

Structure

├── Ahearne_David_CAM_C301_Week_4and5_Topic_project.ipynb   # Full analysis notebook
├── Ahearne_David_CAM_C301_Week_4and5_Topic_project.pdf     # Written report
└── README.md

About

Applied a multi-method NLP pipeline. Combining word frequency analysis, BERTopic, BERT-based emotion classification, LDA, and LLM-assisted theme extraction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors