# LAB 4: Topic modeling

Use topic models to explore hotel reviews

Objectives:

- tokenize with MWEs using spacy
- estimate LDA topic models with tomotopy
- visualize and evaluate topic models
- apply topic models to interpretation of hotel reviews

In [27]:
from collections import Counter

import numpy as np
import pandas as pd
import tomotopy as tp
from cytoolz import *
from tqdm.auto import tqdm

tqdm.pandas()

## Prepare data

In [28]:
df = pd.read_pickle("/data/hotels_id.pkl")
mdl = tp.LDAModel.load("hotel-topics.bin")# loading lda and csv
labels = list(pd.read_csv("labels.csv")["label"])

In [29]:
df[df["overall"] == 1]["offering_id"].value_counts().head(20)# hotel with 1 star reviews

214197     1359
93421       684
223023      486
93520       329
111418      238
112066      233
99766       206
93437       195
99307       179
119728      169
93618       168
80602       157
1938661     156
122007      147
93450       145
93466       145
93464       144
101653      143
93356       139
87595       132
Name: offering_id, dtype: int64

Pick a hotel with a lot of 1 star ratings (other than # 93520) and pull out all of its reviews

In [30]:
hotel = df.query("offering_id==93520").copy()# hotel # 9350
hotel["overall"].value_counts()

4.0    826
5.0    575
3.0    448
1.0    329
2.0    313
Name: overall, dtype: int64

In [31]:
from tokenizer import MWETokenizer

tokenizer = MWETokenizer(open("terms.txt"))

In [32]:
hotel["tokens"] = (hotel["title"] + " " + hotel["text"]).progress_apply(
    tokenizer.tokenize
)
# tokenizing data

  0%|          | 0/2491 [00:00<?, ?it/s]

In [34]:
hotel["tokens"].head()

49680    [bedbugs, bedbugs, no, acknowledgement, no, bi...
49681    [thank, goodness, for, joe, i, stayed, in, thi...
49705    [perfect_location, great_staff, good_room, rig...
49706    [excellent_location, and, great_service, defin...
49707    [confusion, a, so, so, hotel, in, the, heart, ...
Name: tokens, dtype: object

## Apply topic model

In [11]:
hotel["doc"] = [mdl.make_doc(words=toks) for toks in hotel["tokens"]]
topic_dist, ll = mdl.infer(hotel["doc"])

## Interpret model

What topics are associated with a review?

In [12]:
hotel["text"].iloc[0]

'Bedbugs!!!! No acknowledgement, no bill adjustment, just fill out a form for Security. I showed the manager a bite, and I am still itching like crazy! Only where my body was in contact with the bed did I have bites.'

In [13]:
hotel["doc"].iloc[0].get_topics(top_n=5)# topic 33 20% of reviews

[(40, 0.21920356154441833),
 (42, 0.10160670429468155),
 (21, 0.06386015564203262),
 (26, 0.06159566715359688),
 (48, 0.05676766484975815)]

In [14]:
mdl.get_topic_words(40)

[('front_desk', 0.03532785177230835),
 ('told', 0.03505776450037956),
 ('called', 0.031060485169291496),
 ('asked', 0.021985579282045364),
 ('call', 0.020419077947735786),
 ('never', 0.02036506123840809),
 ('them', 0.0200949739664793),
 ('said', 0.01755616068840027),
 ('down', 0.013180759735405445),
 ('another', 0.013126742094755173)]

In [16]:
mdl.get_topic_words(42)

[('am', 0.023971909657120705),
 ('them', 0.021705320104956627),
 ('people', 0.01980828121304512),
 ('their', 0.019266270101070404),
 ('because', 0.018108338117599487),
 ('know', 0.01611275225877762),
 ('sure', 0.015866383910179138),
 ('way', 0.015373646281659603),
 ('say', 0.01502873096615076),
 ('think', 0.014757725410163403)]

In [18]:
[(labels[x], y) for x, y in hotel["doc"].iloc[0].get_topics(top_n=5)]
# topic number changes

[('FRONT_DESK', 0.21920356154441833),
 ('AM', 0.10160670429468155),
 ('NOISE', 0.06386015564203262),
 ('REVIEWS', 0.06159566715359688),
 ('ELEVATOR', 0.05676766484975815)]

What are the most common topics?

In [19]:
hotel["topics"] = [
    [labels[t] for t in map(first, d.get_topics(3))] for d in hotel["doc"]
]

In [21]:
hotel["topics"]
# mapping each review to top 3 topics

49680     [FRONT_DESK, AM, NOISE]
49681             [AM, THEIR, HE]
49705        [THEIR, NYC, SHOWER]
49706     [RECOMMEND, THEIR, NYC]
49707            [NYC, CHECK, 'D]
                   ...           
119799           [3, NYC, STREET]
120641       [CHECK, BOOKED, ANY]
122697           [PRICE, AM, BIT]
123934             [BED, AM, BIT]
128080            [FREE, AM, BIT]
Name: topics, Length: 2491, dtype: object

In [22]:
topic_freq = Counter(concat(hotel["topics"]))
topic_freq.most_common()

[('AM', 1008),
 ('NYC', 691),
 ('FRONT_DESK', 497),
 ('CHECK', 456),
 ('BIT', 455),
 ('RECOMMEND', 399),
 ('ALWAYS', 315),
 ('WITHIN', 241),
 ('PRICE', 220),
 ('AROUND', 202),
 ('DIRTY', 202),
 ('THEIR', 194),
 ('3', 176),
 ('ANY', 166),
 ('SHOWER', 157),
 ('THEN', 152),
 ('REVIEWS', 148),
 ('BOOKED', 145),
 ("'D", 135),
 ('NOISE', 118),
 ('ELEVATOR', 117),
 ('STREET', 104),
 ('FOUND', 100),
 ('HE', 98),
 ('BED', 96),
 ('LOVED', 95),
 ('COLD', 90),
 ('ITS', 83),
 ('RESTAURANT', 70),
 ('FREE', 69),
 ('SHE', 68),
 ('MINUTES', 56),
 ('COFFEE', 53),
 ('LOCATED', 52),
 ('NICELY', 47),
 ('BEST', 42),
 ('LOBBY', 33),
 ('AIRPORT', 28),
 ('PARKING', 22),
 ('HILTON', 17),
 ('WINE', 13),
 ('DC', 10),
 ('SUITE', 8),
 ('KIDS', 8),
 ('SAN_FRANCISCO', 7),
 ('POOL', 5),
 ('VIEW', 3),
 ('CONFERENCE', 2)]

Most common topics in 1 star reviews?

In [24]:
topic_freq = Counter(concat(hotel.query("overall==1")["topics"]))
topic_freq.most_common()
# am is big in 1 iverall revioews

[('AM', 180),
 ('FRONT_DESK', 151),
 ('DIRTY', 99),
 ('CHECK', 77),
 ('SHOWER', 41),
 ('BOOKED', 37),
 ('HE', 33),
 ('PRICE', 31),
 ('THEN', 30),
 ('ANY', 29),
 ('ALWAYS', 29),
 ('3', 21),
 ('AROUND', 19),
 ('THEIR', 19),
 ('NOISE', 18),
 ('ELEVATOR', 16),
 ('BED', 16),
 ('COLD', 15),
 ('BIT', 14),
 ('NYC', 14),
 ('REVIEWS', 13),
 ('RECOMMEND', 11),
 ('SHE', 10),
 ('FREE', 8),
 ("'D", 6),
 ('FOUND', 5),
 ('LOBBY', 5),
 ('WITHIN', 5),
 ('PARKING', 5),
 ('ITS', 4),
 ('BEST', 4),
 ('RESTAURANT', 3),
 ('COFFEE', 3),
 ('LOVED', 3),
 ('VIEW', 2),
 ('MINUTES', 2),
 ('HILTON', 2),
 ('AIRPORT', 2),
 ('POOL', 1),
 ('SUITE', 1),
 ('STREET', 1),
 ('LOCATED', 1),
 ('DC', 1)]

Most common topics in 5 star reviews?

In [26]:
topic_freq = Counter(concat(hotel.query("overall==5")["topics"]))
topic_freq.most_common()# nyc big

[('NYC', 224),
 ('RECOMMEND', 200),
 ('AM', 188),
 ('ALWAYS', 136),
 ('CHECK', 83),
 ('THEIR', 80),
 ('WITHIN', 74),
 ('BIT', 62),
 ('AROUND', 57),
 ('LOVED', 49),
 ('PRICE', 39),
 ('FRONT_DESK', 38),
 ('REVIEWS', 36),
 ('3', 31),
 ('BOOKED', 28),
 ('STREET', 28),
 ('SHE', 25),
 ('BEST', 24),
 ('ANY', 22),
 ('RESTAURANT', 22),
 ('LOCATED', 21),
 ('FOUND', 19),
 ('THEN', 19),
 ('MINUTES', 18),
 ("'D", 18),
 ('BED', 18),
 ('NOISE', 17),
 ('ITS', 17),
 ('COFFEE', 16),
 ('SHOWER', 15),
 ('HE', 13),
 ('NICELY', 13),
 ('ELEVATOR', 12),
 ('LOBBY', 10),
 ('FREE', 9),
 ('AIRPORT', 8),
 ('SUITE', 7),
 ('HILTON', 6),
 ('COLD', 5),
 ('PARKING', 4),
 ('WINE', 4),
 ('KIDS', 3),
 ('CONFERENCE', 2),
 ('DIRTY', 2),
 ('DC', 1),
 ('SAN_FRANCISCO', 1),
 ('VIEW', 1)]

## Report

Finish this notebook by writing a brief report to the hotel managers describing what you've found in the reviews of their hotel, along with some actionable advice. Use whatever data, charts, word clouds, etc. that you think will help you make your case.

In [None]:
# can apply to other hotel reviews