A topic project completed as part of the Cambridge Data Science Career Accelerator (C301, Weeks 4 & 5).
The project applies a multi-method NLP pipeline to two independent PureGym review datasets — Google Reviews and TrustPilot — to identify customer pain points and generate actionable improvement recommendations.
Despite strong brand performance, PureGym faces unresolved customer issues that risk damaging retention and revenue. Manually reviewing large volumes of customer feedback is impractical. Applying NLP can systematically surface the root causes of dissatisfaction, enabling targeted fixes that are more likely to improve customer satisfaction and reduce reliance on discounting as a retention strategy.
Two review datasets covering 12 months of PureGym feedback:
| Dataset | Unique Locations | Notes |
|---|---|---|
| 512 | Scored 1–5 via Overall Score | |
| TrustPilot | 377 | Scored 1–5 via Stars |
310 locations were common to both datasets. Negative reviews were defined as scores below 3 in both sources.
Language detection was performed using langdetect to obtain appropriate stopword lists. Non-English and unknown language reviews (including short text flagged as Welsh) were excluded from analysis.
The analysis was structured in progressive stages, from simple statistical exploration through to semantic modelling and LLM-assisted insight generation.
Text was lowercased, tokenised using NLTK's word_tokenize, and stopwords were removed. Frequency distributions and word clouds were generated for all reviews and negative reviews separately, across both datasets.
Both datasets showed a broadly positive tone overall, with high-frequency terms such as "equipment", "great", "good", "clean", and "friendly". Isolating negative reviews shifted vocabulary toward "equipment", "staff", "machines", and "membership", providing initial signal but without directional context — word frequency alone cannot capture sentiment.
BERTopic was applied to the combined review corpus using default parameters and then with hyperparameter tuning. The basic model produced a large number of small, fragmented topics with a high proportion of uncategorised reviews (topic -1: 7,809). After tuning, the uncategorised count reduced to 6,204, topic counts grew, and the intertopic distance map showed clearer separation.
The top 10 topics from the tuned model were labelled using an LLM (Falcon-7b-instruct), producing the following themes:
| Theme | Summary |
|---|---|
| Equipment | Variety and availability of weights, machines, and benches |
| Facilities | Temperature and comfort of showers and changing rooms |
| Hygiene | Cleanliness and odour of toilets and changing rooms |
| Parking | Availability and convenience of free parking |
| Space | Ample equipment areas but crowding concerns |
| Climate | Air conditioning, saunas, temperature control |
| Membership | Joining processes, fees, codes, and account activation |
| Classes | Availability and quality of workout classes |
A similarity matrix revealed overlap between equipment-related topics, suggesting shared content across those clusters.
Locations were ranked by negative review count across both datasets and combined into a cross-source ranking. The top worst-performing locations by total negative review count were:
| Location | TrustPilot | Total | |
|---|---|---|---|
| London Stratford | 59 | 22 | 81 |
| London Enfield | 25 | 23 | 48 |
| London Swiss Cottage | 24 | 15 | 39 |
| London Seven Sisters | 18 | 16 | 34 |
| London Bermondsey | 16 | 18 | 34 |
High London representation reflects higher overall review volume rather than necessarily worse performance. Word frequency analysis on the top 30 worst-performing locations showed that positive terms like "clean" and "friendly" disappeared compared to the full dataset, though vocabulary shifts were limited without sentiment context.
BERTopic applied to this location subset produced fewer but larger, better-separated topics with clearer LLM-generated summaries, including location-specific issues such as maintenance problems, parking fines, and staff complaints.
BERT-based emotion classification was applied to assign one of six emotions (anger, fear, joy, love, sadness, surprise) to each review. The model has no "unclassified" label, which introduces a joy bias — short or ambiguous negative reviews were frequently misclassified as joy. This was confirmed by manual inspection.
Despite this limitation, filtering to angry reviews produced cleaner and more consistent topic signals, with fewer uncategorised reviews and more clearly separated clusters in BERTopic. Anger and sadness dominated negative review emotion distributions, aligning with expectations.
Restricting the corpus to reviews classified as angry and paired with low scores produced the most actionable topic set, with roughly 40 topics and strong cluster separation. LLM-generated summaries of these topics were unambiguously negative:
| Theme | Summary |
|---|---|
| Equipment | Availability and maintenance of machines |
| Attitude | Disrespectful and unprofessional staff behaviour |
| Scheduling | Booking and cancellation issues for classes |
| Cancellation | Membership cancellation processes and unexpected billing |
| Membership | Access, enrollment, and cancellation experiences |
| Pricing | Membership fees and unexpected charges |
| Parking | Fines and unclear parking regulations |
A random sample of 1,000 negative reviews was passed to Falcon-7b-instruct with a prompt requesting the top 3 themes per review. BERTopic was then applied to the LLM-extracted themes rather than raw review text. This produced more semantically distinct topics and reduced unclassified reviews further, though at significantly higher computational cost.
The LLM was also prompted to generate structured improvement suggestions per theme, producing a directly actionable output.
LDA was applied as a comparative baseline. Lemmatisation was applied prior to fitting. pyLDAvis was used to inspect topic separation and term relevancy, with the relevancy slider set to approximately 0.7 to balance term commonality and exclusivity. Topics showed mostly strong separation with minor overlap between topics 1, 2, 4, 6, and 9.
LDA provides fast, lightweight topic exploration using a Bayesian bag-of-words approach, but lacks semantic understanding — making it a useful first pass rather than a primary analysis method.
Seven key improvement areas were identified from the negative angry review analysis:
| Issue Area | Key Problems | Recommendations |
|---|---|---|
| Customer Service | Unhelpful staff, poor communication, inconsistent responses | Staff training, complaint tracking system |
| Cleanliness and Hygiene | Dirty changing rooms, bad odours, no soap or water stations | Increased cleaning frequency, restocking amenities |
| Equipment Availability | Broken, worn-out, or insufficient equipment | Investment in new equipment, better maintenance schedules |
| Membership Policies | Unexpected charges, difficult cancellation, poor communication | Clearer terms, flexible cancellation options |
| Parking and Access | Fines, unclear parking rules, access issues for members | Clearer signage, discounted day passes for off-peak visits |
| Class Availability and Quality | Limited evening classes, cancellations without notice | Expanded timetables, better cancellation communication |
| General Gym Experience | Overcrowding, loud music, poor atmosphere during peak hours | Regular member feedback, peak-hour management strategies |
| Method | Strengths | Limitations |
|---|---|---|
| Word Frequency | Fast, simple, interpretable | No sentiment or context |
| LDA | Lightweight, fast, good for initial exploration | Bag-of-words, no semantic understanding |
| BERTopic | Semantic clustering, good topic separation | High uncategorised rate on small subsets |
| Emotion (BERT) | Useful for filtering, anger signal is reliable | Joy bias due to no "unclassified" label, misclassifies short text |
| LLM Themes | Best semantic quality, directly actionable summaries | High computational cost, limited to sampled subset |
pandas
numpy
nltk
langdetect
wordcloud
matplotlib
seaborn
bertopic
transformers
gensim
pyLDAvis
sentence-transformers├── Ahearne_David_CAM_C301_Week_4and5_Topic_project.ipynb # Full analysis notebook
├── Ahearne_David_CAM_C301_Week_4and5_Topic_project.pdf # Written report
└── README.md