## happyHotel Data Challenge

### Overall goal:
The overall objective of this data challenge is to identify topics in written hotel reviews in order to provide recommendations for specific areas of improvement to different hotel branches.

### Data:
The data consist of free-form text reviews from visitors to ten different hotels. Each review also comes with a corresponding rating of 'happy' or 'not happy' as reported by the reviewer.

### Specific objectives:
    1) Design and execute a method to identify topics within the reviews

    2) Assign each hotel a score for each topic

    3) Make specific recommendations to the general managers of each hotel

#### Additional objectives for engineering challenge:

    4) How would you design this system to update over time?

    5) How would you persist topics from one timestep/update to another?

    6) How would you design your scores so they can meaningfully understand when they’re doing better?

### Initial thoughts and plan

This is a topic modelling challenge. I will start with a general exploration of the data and data cleanliness. I will then preprocess and vectorize the text in preparation for modelling. I will use LDA to extract topics from the reviews and score each hotel by its average topic weight.

In [2]:
import pandas as pd
import nltk, re, pprint
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

In [3]:
df_happy = pd.read_csv('hotel_happy_reviews.csv')
df_sad = pd.read_csv('hotel_not_happy_reviews.csv')

In [3]:
# concatenate the two dataframes
df = pd.concat([df_happy, df_sad]).sort_values(by='User_ID').reset_index(drop=True)
df

Unnamed: 0,User_ID,Description,Is_Response,hotel_ID
0,id10326,The room was kind of clean but had a VERY stro...,not happy,3
1,id10327,I stayed at the Crown Plaza April -- - April -...,not happy,9
2,id10328,I booked this hotel through Hotwire at the low...,not happy,3
3,id10329,Stayed here with husband and sons on the way t...,happy,8
4,id10330,My girlfriends and I stayed here to celebrate ...,not happy,3
...,...,...,...,...
38927,id49253,We arrived late at night and walked in to a ch...,happy,8
38928,id49254,The only positive impression is location and p...,not happy,2
38929,id49255,Traveling with friends for shopping and a show...,not happy,5
38930,id49256,The experience was just ok. We paid extra for ...,not happy,4


In [4]:
# check for nulls
df.isnull().sum()

User_ID        0
Description    0
Is_Response    0
hotel_ID       0
dtype: int64

In [5]:
# check for duplicates
print(df.duplicated().sum())
print(df.User_ID.duplicated().sum())

0
0


In [6]:
# make sure response column has only two values
df.Is_Response.value_counts()

happy        26521
not happy    12411
Name: Is_Response, dtype: int64

In [7]:
# inspect hotel column
df.hotel_ID.value_counts()

4     6847
5     6682
8     5353
7     5317
3     5082
1     3929
2     2058
10    1511
6     1157
9      996
Name: hotel_ID, dtype: int64

Data looks clean. There is an imbalance of data between the hotels. Explore further and correct if necessary.

In [8]:
# look at number and frequency of hotels in each df

df_freq = pd.DataFrame()

df_freq['happy_n'] = df_happy.hotel_ID.value_counts()
df_freq['sad_n'] = df_sad.hotel_ID.value_counts()
df_freq['sad_happy_ratio'] = df_freq['sad_n'] / df_freq['happy_n']
df_freq['happy_freq'] = df_happy.hotel_ID.value_counts() / len(df_happy)
df_freq['sad_freq'] = df_happy.hotel_ID.value_counts() / len(df_sad)
df_freq = df_freq.sort_index()
df_freq

Unnamed: 0,happy_n,sad_n,sad_happy_ratio,happy_freq,sad_freq
1,2179,1750,0.803121,0.082161,0.17557
2,1046,1012,0.967495,0.03944,0.08428
3,3470,1612,0.464553,0.13084,0.279591
4,4651,2196,0.472157,0.17537,0.374748
5,5540,1142,0.206137,0.208891,0.446378
6,823,334,0.405832,0.031032,0.066312
7,3019,2298,0.761179,0.113834,0.243252
8,4503,850,0.188763,0.16979,0.362823
9,513,483,0.94152,0.019343,0.041334
10,777,734,0.944659,0.029298,0.062606


There is an imbalance across hotels as well as happy and not happy reviews. I'm not sure how imbalances affect topic modelling but I imagine that they do because NLP methods take into account the frequency of words across documents. I also know that some NLP methods take a long time to run. To take care of both of these issues I will start by subsampling the dataset - this will decrease the size of the dataset as well as allow me to take equal samples from each group.

In [9]:
# subsample 334 reviews from each category and hotel
tmp_happy = df_happy.groupby('hotel_ID').apply(lambda x: x.sample(n=334, random_state=1))
tmp_sad = df_sad.groupby('hotel_ID').apply(lambda x: x.sample(n=334, random_state=1))

# concatenate the two dataframes
df = pd.concat([tmp_happy, tmp_sad]).sort_values(by='User_ID').reset_index(drop=True)
df

Unnamed: 0,User_ID,Description,Is_Response,hotel_ID
0,id10327,I stayed at the Crown Plaza April -- - April -...,not happy,9
1,id10335,"Wonderful staff, great location, but it was de...",not happy,3
2,id10338,We stay at the Jolly Madison over the Xmas per...,not happy,8
3,id10340,I found the hotel clean and nicely located. Go...,happy,3
4,id10346,"Having stayed at many Hilton properties, I exp...",not happy,10
...,...,...,...,...
6675,id49225,Bugs in the room. Told the front desk and they...,not happy,7
6676,id49227,Booked this through hotwire at $--- a night. T...,not happy,5
6677,id49228,"After having stayed at the Waldorf many times,...",happy,7
6678,id49245,I didn't get to see much of Seattle but the ho...,happy,8


In [10]:
# sanity check
df.groupby(['hotel_ID', 'Is_Response']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,User_ID,Description
hotel_ID,Is_Response,Unnamed: 2_level_1,Unnamed: 3_level_1
1,happy,334,334
1,not happy,334,334
2,happy,334,334
2,not happy,334,334
3,happy,334,334
3,not happy,334,334
4,happy,334,334
4,not happy,334,334
5,happy,334,334
5,not happy,334,334


### Clean and preprocess the text

In [16]:
# do some basic cleaning

# make separate df for text only
docs = df.Description.to_frame()

# make everything lower case
docs['Description'] = docs.Description.str.lower()

# remove non-alphabet characters
docs['Description'] = docs.Description.apply(lambda x: re.sub(r"[^a-z ]", r"", x))

docs.head()

Unnamed: 0,Description
0,i stayed at the crown plaza april april th...
1,wonderful staff great location but it was defi...
2,we stay at the jolly madison over the xmas per...
3,i found the hotel clean and nicely located goo...
4,having stayed at many hilton properties i expe...


In [17]:
# tokenize
docs['tokens'] = docs.Description.apply(lambda x: word_tokenize(x))
docs.head()

Unnamed: 0,Description,tokens
0,i stayed at the crown plaza april april th...,"[i, stayed, at, the, crown, plaza, april, apri..."
1,wonderful staff great location but it was defi...,"[wonderful, staff, great, location, but, it, w..."
2,we stay at the jolly madison over the xmas per...,"[we, stay, at, the, jolly, madison, over, the,..."
3,i found the hotel clean and nicely located goo...,"[i, found, the, hotel, clean, and, nicely, loc..."
4,having stayed at many hilton properties i expe...,"[having, stayed, at, many, hilton, properties,..."


In [19]:
# create a set of stop words
stops = set(stopwords.words('english'))

# remove stop words
docs['cleaned'] = docs.tokens.apply(lambda x: [word for word in x if not word in stops])

docs.head()

Unnamed: 0,Description,tokens,cleaned
0,i stayed at the crown plaza april april th...,"[i, stayed, at, the, crown, plaza, april, apri...","[stayed, crown, plaza, april, april, staff, fr..."
1,wonderful staff great location but it was defi...,"[wonderful, staff, great, location, but, it, w...","[wonderful, staff, great, location, definately..."
2,we stay at the jolly madison over the xmas per...,"[we, stay, at, the, jolly, madison, over, the,...","[stay, jolly, madison, xmas, period, main, fea..."
3,i found the hotel clean and nicely located goo...,"[i, found, the, hotel, clean, and, nicely, loc...","[found, hotel, clean, nicely, located, good, f..."
4,having stayed at many hilton properties i expe...,"[having, stayed, at, many, hilton, properties,...","[stayed, many, hilton, properties, expect, fri..."


In [20]:
# stem each token (I prefer lemmatizing but it will take too long)

stemmer = PorterStemmer()

docs['tokens'] = docs.tokens.apply(lambda x: [stemmer.stem(word) for word in x])

In [21]:
# rejoin words
docs['cleaned'] = docs.cleaned.apply(lambda x: ' '.join(x))

In [22]:
docs.head()

Unnamed: 0,Description,tokens,cleaned
0,i stayed at the crown plaza april april th...,"[i, stay, at, the, crown, plaza, april, april,...",stayed crown plaza april april staff friendly ...
1,wonderful staff great location but it was defi...,"[wonder, staff, great, locat, but, it, wa, def...",wonderful staff great location definately pric...
2,we stay at the jolly madison over the xmas per...,"[we, stay, at, the, jolli, madison, over, the,...",stay jolly madison xmas period main feature lo...
3,i found the hotel clean and nicely located goo...,"[i, found, the, hotel, clean, and, nice, locat...",found hotel clean nicely located good free shu...
4,having stayed at many hilton properties i expe...,"[have, stay, at, mani, hilton, properti, i, ex...",stayed many hilton properties expect friendly ...


### Topic Modelling

I will start by randomly setting the number of topics at 10 and then decreasing or increasing that number if needed.

In [23]:
n_components = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

Vectorize the documents. From my reading it seems that LDA uses term frequencies (tf), without the additional inverse document frequencies (idf). I don't yet understand why that is but I'm going with it. *Note to self: think about this and the implications for balanced/unbalanced dataset.*

In [29]:
# vectorize
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=2000) # capped at 2000 for speed
tf = tf_vectorizer.fit_transform(docs.cleaned)

In [30]:
# fit the LDA model
lda = LatentDirichletAllocation(n_components=n_components,
                                max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=1)
lda.fit(tf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='online', learning_offset=50.0,
                          max_doc_update_iter=100, max_iter=5,
                          mean_change_tol=0.001, n_components=10, n_jobs=None,
                          perp_tol=0.1, random_state=1, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [42]:
# get top 20 terms in each topic
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Topic #0: ice machine iron machines board hill apart ironing vending capitol october hairdryer works missing havent report rooftop incident fell cafe
Topic #1: hotel room staff great us stay service would rooms time stayed friendly one helpful nice location well desk back bar
Topic #2: room breakfast hotel nice good one bed coffee bathroom free small also night area water great stayed pool two floor
Topic #3: room dirty old carpet bathroom sheets cleaned hotel motel bed towels stains looked clean rooms night like filthy stayed first
Topic #4: hotel location great good staff clean stay rooms room nice stayed walk would friendly square comfortable area breakfast street walking
Topic #5: inn hilton property holiday tower sheraton points harbor pool hampton bay starwood marina inner baltimore member renovation parking area facility
Topic #6: walls thin westin hear hair alarm clock dryer paper hallway frequently neighbors camera traveller heavenly connect wood radio london black
Topic #7: d

Looks like some of the topics are related to location. If I had more time I would use part-of-speech tagging to remove proper nouns.

Get topic scores for each review

In [37]:
# get topic scores for each review
topic_scores = pd.DataFrame(lda.transform(tf))

# add happy/sad labels
topic_scores['Is_Response'] = df['Is_Response']

# add hotel IDs
topic_scores['hotel_ID'] = df['hotel_ID']

topic_scores

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,Is_Response,hotel_ID
0,0.056937,0.189243,0.511272,0.098135,0.001124,0.001124,0.001124,0.116630,0.023288,0.001124,not happy,9
1,0.003226,0.003227,0.195238,0.003227,0.607260,0.003226,0.003226,0.003226,0.003226,0.174919,not happy,3
2,0.010026,0.321411,0.000481,0.000481,0.431602,0.000481,0.000481,0.000481,0.000481,0.234076,not happy,8
3,0.004000,0.004001,0.004002,0.004001,0.563821,0.004001,0.232893,0.175281,0.004000,0.004001,happy,3
4,0.000746,0.231122,0.086089,0.000746,0.000746,0.045691,0.000746,0.166562,0.000746,0.466804,not happy,10
...,...,...,...,...,...,...,...,...,...,...,...,...
6675,0.004000,0.004001,0.004001,0.230485,0.004000,0.004000,0.004000,0.004001,0.004000,0.737512,not happy,7
6676,0.001961,0.001961,0.001961,0.001961,0.001961,0.001961,0.001961,0.001961,0.001961,0.982350,not happy,5
6677,0.001818,0.422221,0.563231,0.001819,0.001819,0.001819,0.001819,0.001818,0.001818,0.001819,happy,7
6678,0.003226,0.328006,0.255261,0.003226,0.181639,0.098183,0.003226,0.120780,0.003226,0.003227,happy,8


Check overall 'goodness' and 'badness' of each topic

In [38]:
topic_scores.groupby(['Is_Response']).mean()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,hotel_ID
Is_Response,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
happy,0.004391,0.320245,0.156434,0.011916,0.358133,0.014018,0.005968,0.037308,0.00435,0.087237,5.5
not happy,0.004027,0.116467,0.157116,0.054692,0.176541,0.014829,0.008282,0.064885,0.00366,0.399501,5.5


It looks like topics 1 and 4 have a general positive sentiment, topics 3 and 9 have a general negative sentiment, and the rest of the topics are neutral. If I had more time I would decrease the number of topics because there seems to be some redundancy. I would also look up methods that can give me a more quantitative idea of topic cohesion to help me choose better number of topics.

Look at topic averages for each hotel

In [41]:
topic_scores.groupby(['hotel_ID']).mean()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9
hotel_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0.004335,0.208542,0.144235,0.029255,0.300205,0.012859,0.006942,0.053821,0.003967,0.235838
2,0.004037,0.224546,0.158883,0.035727,0.269491,0.016731,0.006984,0.053258,0.003976,0.226367
3,0.004,0.22128,0.157046,0.038532,0.274336,0.017775,0.007374,0.046558,0.004058,0.229041
4,0.003651,0.213369,0.159951,0.034412,0.268187,0.013413,0.007374,0.048417,0.003708,0.247518
5,0.004407,0.212802,0.167259,0.029254,0.266407,0.013913,0.006648,0.047443,0.003855,0.248012
6,0.00487,0.228169,0.153165,0.033485,0.263873,0.01371,0.006777,0.048207,0.004003,0.243741
7,0.004142,0.22442,0.165522,0.033218,0.240739,0.014512,0.008195,0.053894,0.004095,0.251265
8,0.003991,0.208672,0.150334,0.034931,0.268131,0.012117,0.006601,0.052998,0.004182,0.258043
9,0.004044,0.220639,0.145438,0.028198,0.269063,0.01497,0.007737,0.054907,0.003833,0.25117
10,0.004612,0.22112,0.165915,0.036027,0.252936,0.014238,0.006616,0.051464,0.004377,0.242695


This needs visualization (and possibly also clustering to help make insights more apparent)

In the meantime here are some quick comments:

- Nice job to Hotel 6 on scoring highest in staff friendliness and to Hotel 1 on scoring highest in cleanliness
- Hotel 3 you need to up your ante on cleanliness - seriously, it's gross
- Hotels 7, 8, and 9 looks like your front desk staff might need some extra training