# Review analysis to find potential improvement for customer satisfacition

The general idea is to analyse hotel reviews to find corresponding topics each review and analyse the negative ones to find what can be improved. 

Later the results shall be used to train a network and substituting the statistical part (LDA) with a different approach. 

We have to be caution with the overall results because the dataset contains reviews for different hotels, resorts, and hostels from TripAdvisor. Hence, the data is not really homogeneous which will add topics for different kind of hotels. 

## Structure

1. [Prerequisite](#1.0)
2. [Data preprocessing](#2.0)
3. [Choosing model parameters](#3.0)
    * [Tuning hyperparameter](#3.1)
4. [Keyword Extraction](#4.0)

<a id='1.0'></a>
## 1.0 Prerequisite

In [1]:
import pandas as pd
import numpy as np
import contractions
from gensim.models import CoherenceModel, LdaModel
import gensim
import spacy
import pathlib
#from scipy.stats import skew, kurtosis, mode

In [2]:
# Data Visualisation
import pyLDAvis.gensim
import plotly.graph_objects as go
import plotly.figure_factory as ff

%matplotlib inline
pyLDAvis.enable_notebook()

In [3]:
from preprocessing import text_preprocessing, get_word_frequency, get_rake_phrases, rake_preprocessing
from n_grams import context_processing
from modelprocessing import get_doc_topic_matrix, get_topic_word_matrix, merge_doc_word_matrix
from modelprocessing import get_importance_normalisation, get_sentiment_normalisation_model, get_sentiment_normalisation_rake_words

In [4]:
# path to svae the data
path = pathlib.Path("C:/Users/Simon/Desktop/Projects/Topic-Analysis/Data/")

In [6]:
%load_ext autoreload
%autoreload 2

In [7]:
#Load the data set
data_all = pd.read_csv("Data/tripadvisor_hotel_reviews.csv")

<a id='2.0'></a>
## 2.0 Data Preprocessing

Preprocessing the text is done after loading the data.

In [8]:
STOP_WORDS = spacy.lang.en.stop_words.STOP_WORDS

First we need to check the stopwords, to see if we want to remove add certain words.

In our case we will remove no, not, without, very, again. They might be used if a room does not have a certain feature or the guest would come again.

Other words will be added, st, nd, rd, th which are used for 1st, 2nd and so on. Additionaly the time words as am and pm will be removed.

With contractions we can fix some of the spelling mistakes, to improve the results. In general it could be possible to do a spell check before processing, to further improve the results. For simplicity reasons this is dropped in this case. 

In [9]:
STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [10]:
# Removing certain stopwords
to_remove = ['no', 'not', 'without', 'very', 'again']
for i, value in enumerate(to_remove):
    STOP_WORDS.remove(value)

In [11]:
to_add = ['st', 'nd', 'rd', 'th', 'pm', 'pmam', 'ampm', 'oh', 'yeah', 'yea', 'lol', 'oh', 'ok', 'opt', 'dr', 'etc', 'com', 'usd', 'euro', 'con']
for i, value in enumerate(to_add):
    STOP_WORDS.add(value)

In [12]:
STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'ampm',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'com',
 'con',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'dr',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'etc',
 'euro',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'former

In [13]:
contractions.add("wo n't", 'will not')
contractions.add("can n't", 'can not')
contractions.add("didn", 'did not')
contractions.add("wasn", 'was not')
contractions.add("don", 'do not')
contractions.add("n't", 'not')
contractions.add("nt", 'not')
contractions.add("p m ", 'pm')
contractions.add("a m ", 'am')

After preparing the words we can now process the reviews with the text_processing function:

1. All to lower case
2. Check for cases of no whitespaces after punctuation and insert whitespace
3. Remove HTML text
4. Remove accented characters
5. Fixing contractions
6. Remove numbers
7. Remove punctuation
8. Remove single characters
9. Remove extra whitespace
10. Remove enteties (Countries, Cities)
11. Lemmatize token
12. Remove stopwords

In [14]:
data_all['clean_text'] = data_all['Review'].apply(text_preprocessing, stopwords = STOP_WORDS)

In [14]:
data_all['clean_text'][1]

['ok',
 'special',
 'charge',
 'diamond',
 'member',
 'decide',
 'chain',
 'shoot',
 'anniversary',
 'start',
 'book',
 'suite',
 'pay',
 'extra',
 'website',
 'description',
 'suite',
 'bedroom',
 'bathroom',
 'standard',
 'hotel',
 'room',
 'took',
 'print',
 'reservation',
 'desk',
 'thing',
 'like',
 'tv',
 'couch',
 'ect',
 'desk',
 'clerk',
 'tell',
 'oh',
 'mixed',
 'suite',
 'description',
 'website',
 'sorry',
 'free',
 'breakfast',
 'got',
 'kid',
 'embassy',
 'suit',
 'sit',
 'room',
 'bathroom',
 'bedroom',
 'unlike',
 'suite',
 'day',
 'stay',
 'offer',
 'correct',
 'false',
 'advertising',
 'send',
 'prefer',
 'guest',
 'website',
 'email',
 'ask',
 'failure',
 'provide',
 'suite',
 'advertise',
 'website',
 'reservation',
 'description',
 'furnish',
 'hard',
 'copy',
 'reservation',
 'printout',
 'website',
 'desk',
 'manager',
 'duty',
 'reply',
 'solution',
 'send',
 'email',
 'trip',
 'guest',
 'survey',
 'follow',
 'email',
 'mail',
 'guess',
 'tell',
 'concern',
 'g

In the next step we get the frequency of words to see which are the most frequent or might be uninformative for our topics. These words we might want to include in our stopwords list or the will be romved later in the processing.

As expected words like hotel, room, stayy are the top words. Later we will remove words that occure in more than 50% of the doccuments, most probable these words will be dropped.

In [15]:
word_frequency = wordFrequency(data_all['clean_text'])

In [16]:
word_frequency.most_common(50)

[('hotel', 51658),
 ('room', 47301),
 ('stay', 28196),
 ('good', 21925),
 ('great', 20594),
 ('staff', 16714),
 ('nice', 12981),
 ('time', 12090),
 ('location', 11355),
 ('day', 11020),
 ('clean', 10815),
 ('service', 10810),
 ('restaurant', 10167),
 ('breakfast', 9981),
 ('place', 9740),
 ('beach', 9474),
 ('food', 9441),
 ('like', 9321),
 ('walk', 9250),
 ('resort', 8728),
 ('night', 8563),
 ('pool', 8431),
 ('bed', 7710),
 ('small', 7128),
 ('area', 7082),
 ('friendly', 6921),
 ('people', 6849),
 ('want', 6517),
 ('bar', 6464),
 ('little', 6223),
 ('excellent', 6147),
 ('book', 6013),
 ('bathroom', 5950),
 ('recommend', 5944),
 ('view', 5916),
 ('look', 5743),
 ('helpful', 5694),
 ('price', 5556),
 ('trip', 5484),
 ('floor', 5297),
 ('use', 5228),
 ('need', 5164),
 ('water', 5153),
 ('lot', 5117),
 ('check', 4919),
 ('come', 4837),
 ('beautiful', 4713),
 ('thing', 4697),
 ('review', 4696),
 ('eat', 4686)]

Last step is to save the data

In [28]:
data_all.to_json(path / 'data_processed.json')

In [5]:
data_all = pd.read_json(path / 'data_processed.json')

<a id='3.0'></a>
## 3.0 Choosing model parameters

In the first step we calculate one model to get a feeling for the data and see if our expecations match the result. After this evaluation we either have to adjust the preprocessing or can move on.

The context_processing function does the following steps:

1. Phrase a bigramm modell
2. Add bigramm to the text corpus
3. Remove words which are in 505 of the documents
4. Remove words which occur only 20 times or less

In [6]:
corpus, dictionary = context_processing(data_all['clean_text'])

In [7]:
# Set training parameters.
num_topics = 7
chunksize = 2000
passes = 40
iterations = 500
eval_every = 1

# Make a index to word dictionary.
temp = dictionary[0]
id2word = dictionary.id2token

lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                            num_topics= num_topics,
                                                            iterations =iterations,
                                                            id2word=id2word,
                                                            workers=3,
                                                            chunksize=chunksize,
                                                            passes=passes,
                                                            eval_every=eval_every,
                                                            random_state = 12345
                                                           )

Below we plot the different topics. Note that we only get words which are bundled in one topic but not a concrete topic name. This needs to be done by humans.

As a first benchmark we have a model with 7 topics and a coherence score of -1.4232.

The coherence score measures how distinguished the topics are. The more negative the score is the better.
With pyLDAvis we can analyse the topics even further. By decreasing lambda we can increase the importance of words which are unique to the topic.
pyLDAvis uses the the principal component analysis (PCA) to reduce the dimensions to two.

In [8]:
top_topics = lda_model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.1149.
[([(0.13508242, 'orchard_road'),
   (0.085648835, 'star_ferry'),
   (0.07289001, 'royal_club'),
   (0.067015104, 'langham_place'),
   (0.06151889, 'antiche_figure'),
   (0.060043048, 'gran_bahia'),
   (0.057734556, 'aqua_palm'),
   (0.05626854, 'grand_flamenco'),
   (0.05306144, 'coral_princess'),
   (0.0502636, 'harbour_plaza'),
   (0.047902174, 'tropical_princess'),
   (0.047695633, 'nusa_dua'),
   (0.03588286, 'bowling_alley'),
   (0.033453252, 'rice_field'),
   (0.030942375, 'king_cross'),
   (0.025658514, 'ferry_terminal'),
   (0.024183454, 'elite_club'),
   (0.021764612, 'residence_michelangiolo'),
   (0.0054315897, 'light_district'),
   (0.0018245601, 'sea_fishing')],
  -0.341872432269023),
 ([(0.30893368, 'royal_club'),
   (0.22636037, 'grand_flamenco'),
   (0.13071865, 'coral_princess'),
   (0.088141136, 'king_cross'),
   (0.06728149, 'antiche_figure'),
   (0.021349484, 'tropical_princess'),
   (0.019890035, 'light_district'),
   (0.015113066,

In [None]:
# Visualize the topics
pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

We can already see and interprete some of the topics from pyldavis:

1. Semmes to be something like Restaurant /Beach/Pool with words as water, beach, pool, restaurant and bar
2. Topic 2 is centered around the room: Bedroom, Bed, shower
3. Topic 3 seems to be complaint handling since we have words as tell, desk, ask, service, manager
4. Topic 4 might be something like location to public transport
5. With topic 5 we have praise, and recommendation
6. Here we have locations with nerby restaurants
7. The last topic seems to be around special offers like spa. It could be something like ammenties

Now we can improve our results further by setting different hyperparameters and tuning the algorithm. 

Second we can think about analysing only the negative review since we are interested in their topics. 

<a id='3.1'></a>
### 3.1 Tuning hyperparameter

The calculation takes a bit of time!!

We estimate the LDA model with different hyperparameter.
Coherence measures the relative distance between words within a topic. We use the c_m coherence score which is -16 < x < 16 with being the best at -16.

Later we estimate the models for only negative reviews.

The hyposis:
Positive reviews are about a bigger variety of topics. Hence the coherence scor should be bigger (less negative).

In [8]:
#data_all = pd.read_json(path / 'data_processed.json')

In [9]:
chunksize = 2000
passes = 40
iterations = 500
eval_every = 1

min_topics = 6
max_topics = 10
step_size = 1
topic_range = range(min_topics, max_topics, step_size)

eta_list = list(np.arange(0.01, 1, 0.2))
eta_list.append('auto')
decay_list = list(np.arange(0.5, 1, 0.1))


model_results = {'Number Topics': [],
                 'Eta': [],
                 'Decay':[],
                 'Coherence': []
                }

#corpus, dictionary = context_processing(data_all['clean_text'])
temp = dictionary[0]
id2word = dictionary.id2token
  
for num_topics in topic_range:
    for eta in eta_list:
        for decay in decay_list: 
            lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                                num_topics= num_topics,
                                                                iterations =iterations,
                                                                id2word=id2word,
                                                                workers=3,
                                                                chunksize=chunksize,
                                                                passes=passes,
                                                                eval_every=eval_every,
                                                                eta=eta,
                                                                decay=decay,
                                                                random_state = 12345
                                                                           )
            score = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
            cv = score.get_coherence()

            # Save the model results
            model_results['Number Topics'].append(num_topics)
            model_results['Eta'].append(eta)
            model_results['Decay'].append(decay)
            model_results['Coherence'].append(cv)

In [10]:
finetune_models = pd.DataFrame(model_results)
finetune_models.to_json(path / 'finetune_models.json')

In [11]:
finetune_models.nsmallest(10, 'Coherence')

Unnamed: 0,Number Topics,Eta,Decay,Coherence
30,7,0.01,0.5,-1.337464
31,7,0.01,0.6,-1.331845
32,7,0.01,0.7,-1.330638
90,9,0.01,0.5,-1.322202
33,7,0.01,0.8,-1.314028
1,6,0.01,0.6,-1.30576
34,7,0.01,0.9,-1.299256
61,8,0.01,0.6,-1.294855
60,8,0.01,0.5,-1.286154
2,6,0.01,0.7,-1.280064


In [12]:
# Set training parameters.
num_topics = 7
chunksize = 2000
passes = 40
iterations = 500
eval_every = 1
eta =  0.01
decay = 0.5

# Make a index to word dictionary.
temp = dictionary[0]
id2word = dictionary.id2token

lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                                num_topics= num_topics,
                                                                iterations =iterations,
                                                                id2word=id2word,
                                                                workers=3,
                                                                chunksize=chunksize,
                                                                passes=passes,
                                                                eval_every=eval_every,
                                                                eta=eta,
                                                                decay=decay,
                                                                random_state = 12345
                                                   )

In [13]:
top_topics = lda_model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -1.3372.
[([(0.21845077, 'royal_club'),
   (0.16543534, 'grand_flamenco'),
   (0.13441582, 'coral_princess'),
   (0.11528558, 'gran_bahia'),
   (0.09592477, 'tropical_princess'),
   (0.081842944, 'king_cross'),
   (0.06982568, 'bowling_alley'),
   (0.052414298, 'ferry_terminal'),
   (0.015332918, 'light_district'),
   (0.00095137896, 'clean'),
   (0.0009321126, 'nice'),
   (0.00083309895, 'location'),
   (0.0007368107, 'friendly'),
   (0.0006784373, 'service'),
   (0.0006494721, 'excellent'),
   (0.0006456144, 'breakfast'),
   (0.00057768694, 'place'),
   (0.00054293044, 'recommend'),
   (0.0005113482, 'helpful'),
   (0.00046336, 'night')],
  -0.9425637417373972),
 ([(0.020742035, 'beach'),
   (0.020535933, 'resort'),
   (0.016044233, 'food'),
   (0.01278446, 'pool'),
   (0.011476758, 'time'),
   (0.011152692, 'day'),
   (0.010288663, 'people'),
   (0.009364807, 'restaurant'),
   (0.009163896, 'water'),
   (0.008976758, 'like'),
   (0.008930987, 'drink'),
   (0

In [36]:
# Visualize the topics
pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)

The above whe have shown that the coherence score is very close for all thre topic numbers (7 to 9) with different parameters, within -1.5 and -1.7.
The pylldavis shows the distribution of words in the different topics.
Following we will compare the values to the LDA model of only negative reviews.

In [6]:
sub_set = data_all[data_all['Rating']<3].reset_index(drop=True).copy()
corpus, dictionary = context_processing(sub_set['clean_text'])

In [10]:
sub_set

Unnamed: 0,Review,Rating,clean_text
0,ok nothing special charge diamond member hilto...,2,"[special, charge, diamond, member, decide, cha..."
1,"poor value stayed monaco seattle july, nice ho...",2,"[poor, value, stay, nice, hotel, price, night,..."
2,horrible customer service hotel stay february ...,1,"[horrible, customer, service, hotel, stay, fri..."
3,disappointed say anticipating stay hotel monac...,2,"[disappointed, anticipate, stay, hotel, base, ..."
4,great location need internally upgrade advanta...,2,"[great, location, need, internally, upgrade, a..."
...,...,...,...
3209,deceptive staff deceptive desk staff claiming ...,2,"[deceptive, staff, deceptive, desk, staff, cla..."
3210,not impressed unfriendly staff checked asked h...,2,"[not, impressed, unfriendly, staff, check, ask..."
3211,"ok just looks nice modern outside, desk staff ...",2,"[look, nice, modern, outside, desk, staff, not..."
3212,hotel theft ruined vacation hotel opened sept ...,1,"[hotel, theft, ruin, vacation, hotel, open, ha..."


In [10]:
chunksize = 2000
passes = 40
iterations = 500
eval_every = 1

min_topics = 6
max_topics = 10
step_size = 1
topic_range = range(min_topics, max_topics, step_size)

eta_list = list(np.arange(0.01, 1, 0.2))
eta_list.append('auto')
decay_list = list(np.arange(0.5, 1, 0.1))


model_results = {'Number Topics': [],
                 'Eta': [],
                 'Decay':[],
                 'Coherence': []
                }

sub_set = data_all[data_all['Rating']<3].reset_index(drop=True).copy()
corpus, dictionary = context_processing(sub_set['clean_text'])
temp = dictionary[0]
id2word = dictionary.id2token
  
for num_topics in topic_range:
    for eta in eta_list:
        for decay in decay_list: 
            lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                                num_topics= num_topics,
                                                                iterations =iterations,
                                                                id2word=id2word,
                                                                workers=3,
                                                                chunksize=chunksize,
                                                                passes=passes,
                                                                eval_every=eval_every,
                                                                eta=eta,
                                                                decay=decay,
                                                                random_state = 12345
                                                                           )
            score = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
            cv = score.get_coherence()

            # Save the model results
            model_results['Number Topics'].append(num_topics)
            model_results['Eta'].append(eta)
            model_results['Decay'].append(decay)
            model_results['Coherence'].append(cv)

finetune_models_negative = pd.DataFrame(model_results)
finetune_models_negative.to_json(path / 'finetune_models_negative.json')

In [12]:
finetune_models_negative.nsmallest(10, 'Coherence')

Unnamed: 0,Number Topics,Eta,Decay,Coherence
84,8,0.81,0.9,-11.401455
2,6,0.01,0.7,-11.236213
73,8,0.41,0.8,-11.18662
79,8,0.61,0.9,-11.115996
68,8,0.21,0.8,-11.114029
74,8,0.41,0.9,-11.03236
13,6,0.41,0.8,-11.013905
3,6,0.01,0.8,-11.001803
18,6,0.61,0.8,-10.975702
62,8,0.01,0.7,-10.974211


In [22]:
# Set training parameters.
num_topics = 8
chunksize = 2000
passes = 40
iterations = 500
eval_every = 1
eta =  0.81
decay = 0.9

# Make a index to word dictionary.
sub_set = data_all[data_all['Rating']<3].reset_index(drop=True).copy()
corpus, dictionary = context_processing(sub_set['clean_text'])
temp = dictionary[0]
id2word = dictionary.id2token

lda_model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                                num_topics= num_topics,
                                                                iterations =iterations,
                                                                id2word=id2word,
                                                                workers=3,
                                                                chunksize=chunksize,
                                                                passes=passes,
                                                                eval_every=eval_every,
                                                                eta=eta,
                                                                decay=decay,
                                                                random_state = 12345,
                                                                per_word_topics=True
                                                   )

In [23]:
top_topics = lda_model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -9.0731.
[([(0.04130328, 'feel_like'),
   (0.019790504, 'punta_cana'),
   (0.019386994, 'pool_area'),
   (0.018515073, 'smell_like'),
   (0.011234128, 'water_pressure'),
   (0.01113384, 'resort'),
   (0.011070723, 'food'),
   (0.0106987, 'dinner_reservation'),
   (0.009852723, 'mini_bar'),
   (0.009339132, 'la_carte'),
   (0.009264555, 'good'),
   (0.009030773, 'beach'),
   (0.008564609, 'main_buffet'),
   (0.008226766, 'day'),
   (0.007693588, 'time'),
   (0.007461953, 'service'),
   (0.0074480595, 'lunch_dinner'),
   (0.007443051, 'like'),
   (0.0072491644, 'travel_extensively'),
   (0.007071666, 'pool')],
  -2.547590060742536),
 ([(0.032671407, 'great_location'),
   (0.0227998, 'double_bed'),
   (0.015671859, 'block_away'),
   (0.013702159, 'no_idea'),
   (0.013514682, 'bed_comfortable'),
   (0.011349351, 'big_disappointment'),
   (0.010644626, 'far_away'),
   (0.009615658, 'general_manager'),
   (0.009517563, 'bottle_water'),
   (0.008178718, 'staff'),
   (

In [24]:
lda_model.save(fname='model\\2_star\\lda_model')
dictionary.save('model\\2_star\\dictionary')

Compared to the all review model we have more coherent topics, meaning less overlapping in their words.

In the next step we want to extract better topic names with meaning.
We will use the Rapid Automatic Eyword Extraction (RAKE) and compare its result to the LDA algorithm and integrate both.
The result will be a topic name and sub names which will have more meaning.

<a id='4.0'></a>
## 4.0 Keyword extraction

We will be only considering reviews with rating 1 or 2 and use our pre-trained model with the configuration 
1. num_topics = 8
2. eta =  0.81
3. decay = 0.9

With the model we can extract the different topics per review as well as the dominant topic. Afterwards we have the Term per topic matrix which we will also be adding to get the first topic name

In [71]:
data_all = pd.read_json(path / 'data_processed.json')

In [72]:
sub_set = data_all[data_all['Rating']<3].reset_index(drop=True).copy()
corpus, dictionary = context_processing(sub_set['clean_text'])

In [73]:
# Load pretrained model
lda_model = LdaModel.load('model\\2_star\\lda_model')
dictionary =  gensim.utils.SaveLoad.load('model\\2_star\\dictionary')

In [74]:
doc_topic_matrix = get_doc_topic_matrix(sub_set['clean_text'], lda_model, dictionary)
doc_topic_matrix

Unnamed: 0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,dominant_topic,second_topic
0,0.58,0.13,0.0,0.29,0.00,0.00,0.00,0.00,1,4
1,0.00,0.00,0.0,0.00,0.00,0.98,0.00,0.00,6,1
2,0.15,0.00,0.0,0.00,0.12,0.00,0.72,0.00,7,1
3,0.51,0.00,0.0,0.00,0.00,0.00,0.30,0.19,1,7
4,0.38,0.00,0.3,0.00,0.00,0.00,0.31,0.00,1,7
...,...,...,...,...,...,...,...,...,...,...
3209,0.81,0.18,0.0,0.00,0.00,0.00,0.00,0.00,1,2
3210,0.71,0.00,0.0,0.00,0.28,0.00,0.00,0.00,1,5
3211,0.00,0.00,0.0,0.00,0.46,0.00,0.53,0.00,7,5
3212,0.81,0.00,0.0,0.19,0.00,0.00,0.00,0.00,1,4


In [75]:
sub_set = pd.merge(sub_set, doc_topic_matrix, how='inner', left_index=True, right_index=True)

In [76]:
topic_word_matrix = get_topic_word_matrix(lda_model,relevant_word=10)
topic_word_matrix

Unnamed: 0_level_0,Term1,Term2,Term3,Term4,Term5,Term6,Term7,Term8,Term9,Term10
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Topic1,great_location,double_bed,block_away,no_idea,bed_comfortable,big_disappointment,far_away,general_manager,bottle_water,staff
Topic2,hot_water,travel_agent,pay_extra,waste_money,spend_money,save_money,wall_paper,year_old,central_station,business_center
Topic3,look_like,staff_friendly,staff_member,let_know,bed_sheet,poor_quality,language_barrier,single_bed,value_money,fully_book
Topic4,credit_card,desk_clerk,look_forward,horrible_experience,thank_god,web_site,waste_time,bottled_water,sound_like,cigarette_smoke
Topic5,good_thing,read_review,minute_walk,worth_money,open_door,big_deal,air_conditioner,open_window,air_condition,write_review
Topic6,feel_like,punta_cana,pool_area,smell_like,water_pressure,resort,food,dinner_reservation,mini_bar,la_carte
Topic7,customer_service,trip_advisor,breakfast_buffet,train_station,shower_curtain,need_update,parking_lot,bed_bug,king_bed,internet_access
Topic8,air_conditioning,beach_beautiful,ocean_view,walking_distance,good_value,star_rating,good_luck,feel_safe,highly_recommend,holiday_inn


In [77]:
sub_set = merge_doc_word_matrix(sub_set, topic_word_matrix)

In [84]:
# Term lenght is default 5
sub_set['rake_text'] = sub_set['Review'].apply(rake_preprocessing)
all_topics_terms = get_rake_phrases(sub_set, max_length=5)

In [85]:
all_topics_terms["rank"] = all_topics_terms.groupby("topic_number")["score"].rank("dense", ascending=False)

In [88]:
all_topics_terms.loc[all_topics_terms['rank'] < 5]

Unnamed: 0,score,topic_number,term,parent,rank
0,25.000000,1,2 block away convenient hotel <br> great locat...,,1.0
57,24.000000,1,none workers staff asked injured,2 block away convenient hotel <br> great locat...,4.0
56,24.000000,1,desk staff needs training politeness,2 block away convenient hotel <br> great locat...,4.0
55,24.333333,1,hotel staff pleasant tried accomidate,2 block away convenient hotel <br> great locat...,3.0
54,24.500000,1,nobody access room staff hotel,2 block away convenient hotel <br> great locat...,2.0
...,...,...,...,...,...
1481,23.500000,8,hotel great security feel safe,air conditioning mosquitos rampant night <br> ...,4.0
1429,23.500000,8,fact rooms air conditioning clearly,air conditioning mosquitos rampant night <br> ...,4.0
1428,24.000000,8,air conditioning room died addressed,air conditioning mosquitos rampant night <br> ...,3.0
1427,25.000000,8,air conditioning mosquitos rampant night <br> ...,,1.0


In [89]:
df = all_topics_terms.loc[all_topics_terms['rank'] < 5]
fig = go.Figure(go.Sunburst(
    labels = df['term'].values.tolist(),
    parents = df['parent'].values.tolist(),
    values = df['score'].values.tolist()
))

fig.update_layout(margin = dict(t = 0, l = 0,  r = 0, b = 0))
fig.show()