Applying NLP for Topic Modelling in a Real-Life Context

A topic project completed as part of the Cambridge Data Science Career Accelerator (C301, Weeks 4 & 5).

The project applies a multi-method NLP pipeline to two independent PureGym review datasets — Google Reviews and TrustPilot — to identify customer pain points and generate actionable improvement recommendations.

Problem Statement

Despite strong brand performance, PureGym faces unresolved customer issues that risk damaging retention and revenue. Manually reviewing large volumes of customer feedback is impractical. Applying NLP can systematically surface the root causes of dissatisfaction, enabling targeted fixes that are more likely to improve customer satisfaction and reduce reliance on discounting as a retention strategy.

Datasets

Two review datasets covering 12 months of PureGym feedback:

Dataset	Unique Locations	Notes
Google	512	Scored 1–5 via Overall Score
TrustPilot	377	Scored 1–5 via Stars

310 locations were common to both datasets. Negative reviews were defined as scores below 3 in both sources.

Language detection was performed using langdetect to obtain appropriate stopword lists. Non-English and unknown language reviews (including short text flagged as Welsh) were excluded from analysis.

Methods

The analysis was structured in progressive stages, from simple statistical exploration through to semantic modelling and LLM-assisted insight generation.

1. Basic Text Analysis

Text was lowercased, tokenised using NLTK's word_tokenize, and stopwords were removed. Frequency distributions and word clouds were generated for all reviews and negative reviews separately, across both datasets.

Both datasets showed a broadly positive tone overall, with high-frequency terms such as "equipment", "great", "good", "clean", and "friendly". Isolating negative reviews shifted vocabulary toward "equipment", "staff", "machines", and "membership", providing initial signal but without directional context — word frequency alone cannot capture sentiment.

2. BERTopic — Full Dataset

BERTopic was applied to the combined review corpus using default parameters and then with hyperparameter tuning. The basic model produced a large number of small, fragmented topics with a high proportion of uncategorised reviews (topic -1: 7,809). After tuning, the uncategorised count reduced to 6,204, topic counts grew, and the intertopic distance map showed clearer separation.

The top 10 topics from the tuned model were labelled using an LLM (Falcon-7b-instruct), producing the following themes:

Theme	Summary
Equipment	Variety and availability of weights, machines, and benches
Facilities	Temperature and comfort of showers and changing rooms
Hygiene	Cleanliness and odour of toilets and changing rooms
Parking	Availability and convenience of free parking
Space	Ample equipment areas but crowding concerns
Climate	Air conditioning, saunas, temperature control
Membership	Joining processes, fees, codes, and account activation
Classes	Availability and quality of workout classes

A similarity matrix revealed overlap between equipment-related topics, suggesting shared content across those clusters.

3. Location-Specific Analysis

Locations were ranked by negative review count across both datasets and combined into a cross-source ranking. The top worst-performing locations by total negative review count were:

Location	Google	TrustPilot	Total
London Stratford	59	22	81
London Enfield	25	23	48
London Swiss Cottage	24	15	39
London Seven Sisters	18	16	34
London Bermondsey	16	18	34

High London representation reflects higher overall review volume rather than necessarily worse performance. Word frequency analysis on the top 30 worst-performing locations showed that positive terms like "clean" and "friendly" disappeared compared to the full dataset, though vocabulary shifts were limited without sentiment context.

BERTopic applied to this location subset produced fewer but larger, better-separated topics with clearer LLM-generated summaries, including location-specific issues such as maintenance problems, parking fines, and staff complaints.

4. Emotion Analysis

BERT-based emotion classification was applied to assign one of six emotions (anger, fear, joy, love, sadness, surprise) to each review. The model has no "unclassified" label, which introduces a joy bias — short or ambiguous negative reviews were frequently misclassified as joy. This was confirmed by manual inspection.

Despite this limitation, filtering to angry reviews produced cleaner and more consistent topic signals, with fewer uncategorised reviews and more clearly separated clusters in BERTopic. Anger and sadness dominated negative review emotion distributions, aligning with expectations.

5. BERTopic on Negative Angry Reviews

Restricting the corpus to reviews classified as angry and paired with low scores produced the most actionable topic set, with roughly 40 topics and strong cluster separation. LLM-generated summaries of these topics were unambiguously negative:

Theme	Summary
Equipment	Availability and maintenance of machines
Attitude	Disrespectful and unprofessional staff behaviour
Scheduling	Booking and cancellation issues for classes
Cancellation	Membership cancellation processes and unexpected billing
Membership	Access, enrollment, and cancellation experiences
Pricing	Membership fees and unexpected charges
Parking	Fines and unclear parking regulations

6. LLM-Assisted Topic Extraction

A random sample of 1,000 negative reviews was passed to Falcon-7b-instruct with a prompt requesting the top 3 themes per review. BERTopic was then applied to the LLM-extracted themes rather than raw review text. This produced more semantically distinct topics and reduced unclassified reviews further, though at significantly higher computational cost.

The LLM was also prompted to generate structured improvement suggestions per theme, producing a directly actionable output.

7. LDA with Gensim and pyLDAvis

LDA was applied as a comparative baseline. Lemmatisation was applied prior to fitting. pyLDAvis was used to inspect topic separation and term relevancy, with the relevancy slider set to approximately 0.7 to balance term commonality and exclusivity. Topics showed mostly strong separation with minor overlap between topics 1, 2, 4, 6, and 9.

LDA provides fast, lightweight topic exploration using a Bayesian bag-of-words approach, but lacks semantic understanding — making it a useful first pass rather than a primary analysis method.

Identified Issues and Recommendations

Seven key improvement areas were identified from the negative angry review analysis:

Issue Area	Key Problems	Recommendations
Customer Service	Unhelpful staff, poor communication, inconsistent responses	Staff training, complaint tracking system
Cleanliness and Hygiene	Dirty changing rooms, bad odours, no soap or water stations	Increased cleaning frequency, restocking amenities
Equipment Availability	Broken, worn-out, or insufficient equipment	Investment in new equipment, better maintenance schedules
Membership Policies	Unexpected charges, difficult cancellation, poor communication	Clearer terms, flexible cancellation options
Parking and Access	Fines, unclear parking rules, access issues for members	Clearer signage, discounted day passes for off-peak visits
Class Availability and Quality	Limited evening classes, cancellations without notice	Expanded timetables, better cancellation communication
General Gym Experience	Overcrowding, loud music, poor atmosphere during peak hours	Regular member feedback, peak-hour management strategies

Method Comparison

Method	Strengths	Limitations
Word Frequency	Fast, simple, interpretable	No sentiment or context
LDA	Lightweight, fast, good for initial exploration	Bag-of-words, no semantic understanding
BERTopic	Semantic clustering, good topic separation	High uncategorised rate on small subsets
Emotion (BERT)	Useful for filtering, anger signal is reliable	Joy bias due to no "unclassified" label, misclassifies short text
LLM Themes	Best semantic quality, directly actionable summaries	High computational cost, limited to sampled subset

Libraries

pandas
numpy
nltk
langdetect
wordcloud
matplotlib
seaborn
bertopic
transformers
gensim
pyLDAvis
sentence-transformers

Structure

├── Ahearne_David_CAM_C301_Week_4and5_Topic_project.ipynb   # Full analysis notebook
├── Ahearne_David_CAM_C301_Week_4and5_Topic_project.pdf     # Written report
└── README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Applying NLP for Topic Modelling in a Real-Life Context

Problem Statement

Datasets

Methods

1. Basic Text Analysis

2. BERTopic — Full Dataset

3. Location-Specific Analysis

4. Emotion Analysis

5. BERTopic on Negative Angry Reviews

6. LLM-Assisted Topic Extraction

7. LDA with Gensim and pyLDAvis

Identified Issues and Recommendations

Method Comparison

Libraries

Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Ahearne_David_CAM_C301_Week_4and5_Topic_project.ipynb		Ahearne_David_CAM_C301_Week_4and5_Topic_project.ipynb
Ahearne_David_CAM_C301_Week_4and5_Topic_project.pdf		Ahearne_David_CAM_C301_Week_4and5_Topic_project.pdf
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Applying NLP for Topic Modelling in a Real-Life Context

Problem Statement

Datasets

Methods

1. Basic Text Analysis

2. BERTopic — Full Dataset

3. Location-Specific Analysis

4. Emotion Analysis

5. BERTopic on Negative Angry Reviews

6. LLM-Assisted Topic Extraction

7. LDA with Gensim and pyLDAvis

Identified Issues and Recommendations

Method Comparison

Libraries

Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages