# Review analysis to find potential improvement for customer satisfacition

The general idea is to analyse hotel reviews to find corresponding topics each review and analyse the negative ones to find what can be improved. 

Later the results shall be used to train a network and substituting the statistical part (LDA) with a different approach. 

We have to be caution with the overall results because the dataset contains reviews for different hotels, resorts, and hostels from TripAdvisor. Hence, the data is not really homogeneous which will add topics for different kind of hotels. 

## Structure

1. [Prerequisite](#1.0)
2. [Data preprocessing](#2.0)
3. [Choosing model parameters](#3.0)
    * [Evaluating number of topics](#3.1)
    * [Detail comparison](#3.2)
4. [Visualizations](#4.0)

<a id='1.0'></a>
## 1.0 Prerequisite

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Data processing
import LDA_preprocessing
import LDA_contextprocessing 
import LDA_modelprocessing 
import sentiment_functions 
from sklearn.model_selection import train_test_split
from gensim.models import CoherenceModel
import gensim

# Data Visualisation
import pyLDAvis.gensim
import plotly.graph_objects as go
import plotly.figure_factory as ff

%matplotlib inline
pyLDAvis.enable_notebook()

<a id='2.0'></a>
## 2.0 Data Preprocessing

The First step is loading the data. Three additional stopwords are included in the preprocessing. Afterwards, the result is displayed.

The following steps are done in the text_normalization function:

1. Strip HTML
2. Remove accented characters
3. Expand contractions
4. Convert all characters to lowercase
5. Expand cases of "word.word" to "word. word"
6. Remove special characters and digits
7. Remove single and double characters
8. Lemmatization
9. Remove enteties
10. Remove stopwords
11. Remove duplicates
12. Remove extra whitespaces

In [2]:
#Load the data set
data_all = pd.read_csv("Data/tripadvisor_hotel_reviews.csv")

  and should_run_async(code)


In [69]:
new_stopwords = ['pm', 'am', 'ft']

In [70]:
[LDA_preprocessing.stopword_list.append(st) for st in new_stopwords]
LDA_preprocessing.stopword_list = sorted(set(LDA_preprocessing.stopword_list))

In [71]:
data_all['clean_text'] = data_all['Review'].apply(LDA_preprocessing.text_normalization)

In [72]:
data_all['Review'][1]

'ok nothing special charge diamond member hilton decided chain shot 20th anniversary seattle, start booked suite paid extra website description not, suite bedroom bathroom standard hotel room, took printed reservation desk showed said things like tv couch ect desk clerk told oh mixed suites description kimpton website sorry free breakfast, got kidding, embassy suits sitting room bathroom bedroom unlike kimpton calls suite, 5 day stay offer correct false advertising, send kimpton preferred guest website email asking failure provide suite advertised website reservation description furnished hard copy reservation printout website desk manager duty did not reply solution, send email trip guest survey did not follow email mail, guess tell concerned guest.the staff ranged indifferent not helpful, asked desk good breakfast spots neighborhood hood told no hotels, gee best breakfast spots seattle 1/2 block away convenient hotel does not know exist, arrived late night 11 pm inside run bellman bu

In [73]:
data_all['clean_text'][1]

'nothing special charge diamond member decide chain shoot anniversary seattle start book suite pay extra website description suite bedroom bathroom standard take print reservation desk show say thing like couch ect desk clerk tell mixed suite description website sorry free breakfast get kid embassy suit sit bathroom bedroom unlike call suite offer correct false advertising send preferred guest website email ask failure provide suite advertise website reservation description furnish hard copy reservation printout website desk manager duty reply solution send email trip guest survey follow email mail guess tell concerned guest range indifferent helpful ask desk breakfast spot neighborhood hood tell gee breakfast spot block away convenient know exist arrive inside run bellman busy chat cell phone help bag prior arrival email inform anniversary half really picky want make sure get email say like deliver bottle champagne chocolate cover strawberry arrival celebrate tell need foam pillow arr

Following we add the review length as feature and the sentiment via TextBlob.
TextBlob does count negations, for example not good, as negative. Later we can compare that approach to the Sentiment given by the user.

In [74]:
# calculate number of tokens for each review
data_all['ntokens'] = data_all['clean_text'].str.split().str.len()

# Get Sentiment with textblob
data_all['Sent_Blob'] = data_all['Review'].apply(lambda x: sentiment_functions.get_sent(x))

In [12]:
data_all.head()

Unnamed: 0,Review,Rating,clean_text,ntokens,Sent_Blob
0,nice hotel expensive parking got good deal sta...,4,nice hotel expensive parking get good deal sta...,75,0.208744
1,ok nothing special charge diamond member hilto...,2,nothing special charge diamond member decide c...,214,0.214923
2,nice rooms not 4* experience hotel monaco seat...,3,nice room experience hotel level positive larg...,178,0.29442
3,"unique, great stay, wonderful time hotel monac...",5,unique great stay wonderful time hotel locatio...,77,0.504825
4,"great stay great stay, went seahawk game aweso...",5,great stay great stay go seahawk game awesome ...,157,0.384615


In [13]:
data_all['clean_text'].count()

20491

In [14]:
data_all.query("ntokens>=10")
data_all.nsmallest(20, 'ntokens')

Unnamed: 0,Review,Rating,clean_text,ntokens,Sent_Blob
15950,"loved hotel great hotel excellent location, ni...",5,enjoy moment,2,0.72
1501,"loved resort amazing space lot, not bored,",5,love resort amazing space lot bored,6,0.516667
7867,"shiznit spend 300/night not regret, miami note...",5,regret note minute drive south beach,6,0.0
9486,fantastic hotel staff rooms food high standard...,5,high standard handy travel close station,6,0.29
18073,"nice hotel view second room room left row,",5,nice hotel view room leave row,6,0.2
488,"issues n't say 4 star service great pool bar,",3,issue say star service great pool bar,7,0.8
3293,great hotel spent wonderful nights san juan ma...,5,great hotel spend food good definitely stay,7,0.566667
4524,"feeling cheated, westin signature comfy bed wi...",4,feel signature comfy bed willing pay comfort,7,0.25
4708,best hotel stayed days probably best hotel sta...,5,good hotel stay day probably good hotel,7,0.5875
4719,"great city hilton hilton simply wonderful, loc...",5,simply wonderful location fantastic hotel clea...,7,0.641667


In the next step we get the frequency of words to see which are the most frequent or might be uninformative for our topics. These words we might want to include in our stopwords list.

In [15]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [16]:
doc1 = [nlp(doc) for doc in data_all['clean_text']]

In [17]:
def wordFrequ(doc_r):
    word_frequencies = {}
    for doc in doc_r:
        for word in doc:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1
        maximum_frequency = max(word_frequencies.values())
    return word_frequencies

In [18]:
word_freq_vec = wordFrequ(doc1)

In [19]:
word_freq_vec = pd.DataFrame([word_freq_vec]).T

In [20]:
word_freq_vec.nlargest(40, 0)

Unnamed: 0,0
hotel,50461
room,46399
stay,27778
good,21847
great,20333
staff,16544
nice,13001
time,11680
location,11113
day,10914


In [21]:
word_freq_vec[0].describe()

count    40646.000000
mean        44.154972
std        520.990922
min          1.000000
25%          1.000000
50%          1.000000
75%          4.000000
max      50461.000000
Name: 0, dtype: float64

Several words stand out. Especially hotel and room are so common which is to be expected for hotel reviews. But also other words can be included in our list of stopwords. Good, great, staff, nice, location, time, day, clean, service and resort. These words will be probably in most reviews and therefore noisy for the LDA algorithm. Therefore, we will redo the necessary preprocessing steps including these words as stopwords.  Finally, we will save the prepared data.

In [4]:
new_stopwords = ['hotel', 'resort', 'room', 'stay', 'good', 'great', 'staff', 'nice', 'time', 'location', 'day',  'clean', 'service','pm', 'am', 'ft', 'wo']

In [5]:
[LDA_preprocessing.stopword_list.append(st) for st in new_stopwords]
LDA_preprocessing.stopword_list = sorted(set(LDA_preprocessing.stopword_list))

In [6]:
data_all['clean_text'] = data_all['Review'].apply(LDA_preprocessing.text_normalization)
data_all['ntokens'] = data_all['clean_text'].str.split().str.len()

In [65]:
data_all['Review'][1]

'ok nothing special charge diamond member hilton decided chain shot 20th anniversary seattle, start booked suite paid extra website description not, suite bedroom bathroom standard hotel room, took printed reservation desk showed said things like tv couch ect desk clerk told oh mixed suites description kimpton website sorry free breakfast, got kidding, embassy suits sitting room bathroom bedroom unlike kimpton calls suite, 5 day stay offer correct false advertising, send kimpton preferred guest website email asking failure provide suite advertised website reservation description furnished hard copy reservation printout website desk manager duty did not reply solution, send email trip guest survey did not follow email mail, guess tell concerned guest.the staff ranged indifferent not helpful, asked desk good breakfast spots neighborhood hood told no hotels, gee best breakfast spots seattle 1/2 block away convenient hotel does not know exist, arrived late night 11 pm inside run bellman bu

In [66]:
data_all['clean_text'][1]

'nothing special charge diamond member decide chain shoot anniversary seattle start book suite pay extra website description suite bedroom bathroom standard take print reservation desk show say thing like couch ect desk clerk tell mixed suite description website sorry free breakfast get kid embassy suit sit bathroom bedroom unlike call suite offer correct false advertising send preferred guest website email ask failure provide suite advertise website reservation description furnish hard copy reservation printout website desk manager duty reply solution send email trip guest survey follow email mail guess tell concerned guest range indifferent helpful ask desk breakfast spot neighborhood hood tell gee breakfast spot block away convenient know exist arrive inside run bellman busy chat cell phone help bag prior arrival email inform anniversary half really picky want make sure get email say like deliver bottle champagne chocolate cover strawberry arrival celebrate tell need foam pillow arr

In [79]:
data_all.to_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\processed.json')

<a id='3.0'></a>
## 3.0 Choosing model parameters

### Text has to be rewritten after change in the preprocessing


The dataset contains 20491 reviews which we first analysed for their length. The reviews are mostly long with a mean of 104.37 tokens, including punctuations, and a median of 77. Even the minimum is at 7.0 tokens. Which is a good foundation since the Latent Dirichlet Allocation (LDA) performs good on medium or large sized texts compared to short texts (< 50 words). Where short Text Topic Models (STTM) as Gibbs Sampling Dirichlet Mixture Model (GSDMM) tend to perform better. In our case, roughly 27.29% of the reviews have 50 token or less. 

We we will analyse Bi gram and and Tri gram for two different data sets. The first will be with all reviews and the second with reviews with 40 tokens or more in the clean text.

The first approach is to estimate the LDA model based on the whole data set. We split the dataset in train and test, where the test includes 5 reviews which will be unseen by the algorithm.

In [3]:
#data_all = pd.read_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\processed.json')

<a id='3.1'></a>
### 3.1 Evaluating number of topics

The calculation takes a bit of time!!

First we test the coherence score for bigram and trigram to evaluate the best semantic input. Afterwards, we estimate the LDA model with different hyperparameter for one of the semantic inputs.
Coherence measures the relative distance between words within a topic. We use the c_m coherence score which is -16 < x < 16 with being the best at -16.

The results of the calculation were saved in the coh_mat.json in the data subfolder since the calculation is computer intensive and taking time. Furthermore, the results are plotted in the graph below.

In [55]:
coh_mat = pd.DataFrame(index=range(0,20),columns=['bi_gram_all','tri_gram_all','bi_gram_20','tri_gram_20'], dtype='float')
mod = ['bi_gram','tri_gram','bi_gram','tri_gram']
n = 0 
for i in mod:
    if n < 2:
        X_train, X_test = train_test_split(data_all[data_all['ntokens'] > 0]['clean_text'], test_size=5, 
                                   random_state=1)
        results = LDA_contextprocessing.context_processing(X_train, model = i)
        corpus = results['text_corpus']
        id2word = results['id2word']
    if n > 1:
        X_train, X_test = train_test_split(data_all[data_all['ntokens'] > 20]['clean_text'], test_size=5, 
                                   random_state=1)
        results = LDA_contextprocessing.context_processing(X_train, model = i)
        corpus = results['text_corpus']
        id2word = results['id2word']

    for x in range(1, 20):
        lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                                   id2word=id2word,
                                                   num_topics= x,
                                                   random_state=100,
                                                   update_every=1,
                                                   chunksize=100,
                                                   passes=10,
                                                   alpha='auto',
                                                   per_word_topics=True)
        cm = CoherenceModel(model=lda_model, corpus=corpus, coherence='u_mass')
        coh_mat.iloc[x,[n]] = cm.get_coherence()
    n = n+1

In [66]:
df = coh_mat.reset_index().rename(columns = {'index':'Topic number'})

In [57]:
df

Unnamed: 0,Topic number,bi_gram_all,tri_gram_all,bi_gram_40,tri_gram_40
0,0,,,,
1,1,-1.134731,-1.13473,-0.994408,-0.994387
2,2,-1.216467,-1.218616,-1.06214,-1.06522
3,3,-1.290645,-1.252608,-1.140553,-1.163662
4,4,-1.318755,-1.340847,-1.271562,-1.283863
5,5,-1.591316,-1.438737,-1.315353,-1.36601
6,6,-1.863888,-1.525422,-1.477842,-1.455264
7,7,-1.808689,-1.609981,-1.630414,-1.520947
8,8,-1.86987,-1.921752,-1.839198,-1.786615
9,9,-1.816392,-2.205497,-2.004256,-2.09726


In [67]:
df.to_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\coh_mat1.json')

In the following graph we can see the change of the mass coherence score for the two data sets.

We can see as the number of topics increases the mass coherence is also getting better. But this is the curse of clustering algorithms as topic model is as well. As we increase the number of topics we lose the generalization and in each topic is divided in further sub topics.


Therefore, it is important to refine expectations before head of how many topics there can be. 
The Following is a list of potential topics:

1. Location (Distance to Airport/Train Station/City center, area)
2. Property (The hotel itself, garden, pool area)
3. Restraunt/Bar (Food, drink)
4. Room (Bathroom, Bedroom, clean, facilities)
5. Service (How was the service?)
6. Complaints (How were customer complaints handled?)
7. Check-In (How was the reception service)
8. Parking Fee

In [58]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['Topic number'], y=df['bi_gram_all'], mode='lines', name ='Bi gram (all reviews)'))
fig.add_trace(go.Scatter(x=df['Topic number'], y=df['tri_gram_all'], mode='lines', name ='Tri gram (all reviews)'))
fig.add_trace(go.Scatter(x=df['Topic number'], y=df['bi_gram_20'], mode='lines', name='Bi gram (num of token > 20)'))
fig.add_trace(go.Scatter(x=df['Topic number'], y=df['tri_gram_20'], mode='lines', name='Tri gram (num of token > 20)'))
fig.update_layout(
    title = {
        'text' : 'Coherence Score for Bi gram and Tri gram'
        + '<br>' +  '<span style="font-size: 18px;">considering all reviews and only num of token > 20</span>',
        'y' : 0.9},
    title_font_size = 24,
    font=dict(
        family="Calibri",
        size=15,)
    )
fig.show()

In [34]:
# Topics range
min_topics = 6
max_topics = 8
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.1))
alpha.append('symmetric')
alpha.append('asymmetric')
alpha.append('auto')
# Beta parameter
beta = list(np.arange(0.01, 1, 0.1))
beta.append('symmetric')

model_results = {'Data': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

dat = ['bi_gram_all','tri_gram_all','bi_gram_20','tri_gram_20']
conte = ['bi_gram','tri_gram','bi_gram','tri_gram']
n = 0 
for i in conte:
    if n < 2:
        X_train= data_all[data_all['ntokens'] > 0]['clean_text']
        results = LDA_contextprocessing.context_processing(X_train, model = i)
        corpus = results['text_corpus']
        id2word = results['id2word']
    if n > 1:
        X_train= data_all[data_all['ntokens'] > 20]['clean_text']
        results = LDA_contextprocessing.context_processing(X_train, model = i)
        corpus = results['text_corpus']
        id2word = results['id2word']
 
    for k in topics_range:
    # iterate through alpha values
        for a in alpha:
        # iterare through beta values
            for b in beta:
                # get the coherence score for the given parameters
                cv = LDA_modelprocessing.compute_coherence_values(corpus_c=results['text_corpus'], dictionary_c=results['id2word'], k=k, a=a, b=b)
                # Save the model results
                model_results['Data'].append(dat[n])
                model_results['Topics'].append(k)
                model_results['Alpha'].append(a)
                model_results['Beta'].append(b)
                model_results['Coherence'].append(cv)
    n = n+1
df1 = pd.DataFrame(model_results)
df1.to_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\est_coh_mat2.json')

In [42]:
df1

Unnamed: 0,Data,Topics,Alpha,Beta,Coherence
0,bi_gram_all,6,0.01,0.01,-1.637900
1,bi_gram_all,6,0.01,0.11,-1.746415
2,bi_gram_all,6,0.01,0.21,-1.578196
3,bi_gram_all,6,0.01,0.31,-1.540667
4,bi_gram_all,6,0.01,0.41,-1.515226
...,...,...,...,...,...
1139,tri_gram_20,7,auto,0.61,-2.785518
1140,tri_gram_20,7,auto,0.71,-3.363312
1141,tri_gram_20,7,auto,0.81,-3.912227
1142,tri_gram_20,7,auto,0.91,-3.936296


In [43]:
df = df1[df1['Data'] == 'bi_gram_all']
x1 = df[(df['Topics'] == 6) & (df['Coherence'] == df[df['Topics'] == 6]['Coherence'].min())]
df = df1[df1['Data'] == 'tri_gram_all']
x2 = df[(df['Topics'] == 6) & (df['Coherence'] == df[df['Topics'] == 6]['Coherence'].min())]
df = df1[df1['Data'] == 'bi_gram_20']
x3 = df[(df['Topics'] == 6) & (df['Coherence'] == df[df['Topics'] == 6]['Coherence'].min())]
df = df1[df1['Data'] == 'tri_gram_20']
x4 = df[(df['Topics'] == 6) & (df['Coherence'] == df[df['Topics'] == 6]['Coherence'].min())]
df = df1[df1['Data'] == 'bi_gram_all']
x5 = df[(df['Topics'] == 7) & (df['Coherence'] == df[df['Topics'] == 7]['Coherence'].min())]
df = df1[df1['Data'] == 'tri_gram_all']
x6 = df[(df['Topics'] == 7) & (df['Coherence'] == df[df['Topics'] == 7]['Coherence'].min())]
df = df1[df1['Data'] == 'bi_gram_20']
x7 = df[(df['Topics'] == 7) & (df['Coherence'] == df[df['Topics'] == 7]['Coherence'].min())]
df = df1[df1['Data'] == 'tri_gram_20']
x8 = df[(df['Topics'] == 7) & (df['Coherence'] == df[df['Topics'] == 7]['Coherence'].min())]
min_coherence = pd.concat([x1, x2, x3, x4, x5, x6, x7, x8])
min_coherence['token_num'] = [0,0,20,20,0,0,20,20]
min_coherence['model'] =['bi_gram','tri_gram','bi_gram','tri_gram','bi_gram','tri_gram','bi_gram','tri_gram']
#min_coherence.to_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\min_coh.json')
min_coherence = min_coherence.sort_values(['token_num','Data'])
min_coherence

Unnamed: 0,Data,Topics,Alpha,Beta,Coherence,token_num,model
132,bi_gram_all,6,auto,0.01,-1.842468,0,bi_gram
284,bi_gram_all,7,auto,0.91,-4.622747,0,bi_gram
427,tri_gram_all,6,auto,0.91,-3.786338,0,tri_gram
561,tri_gram_all,7,auto,0.01,-1.998014,0,tri_gram
704,bi_gram_20,6,auto,0.01,-1.858298,20,bi_gram
845,bi_gram_20,7,asymmetric,0.91,-3.722914,20,bi_gram
862,tri_gram_20,6,0.01,0.41,-3.089074,20,tri_gram
1142,tri_gram_20,7,auto,0.91,-3.936296,20,tri_gram


<a id='3.2'></a>
### 3.2 Detail comparison

After finding the model parameters for the topics, (7 or 8) and the two different data sets, we will evaluate the difference in choosing a dominant topic ofer the second.
A better model should choose more securly a dominant topic.

In [44]:
n = 0
nam_res =['dif_1','dif_2','dif_3','dif_4','dif_5','dif_6','dif_7','dif_8']

for y in min_coherence['token_num']: 
    results = LDA_contextprocessing.context_processing(data_all[data_all['ntokens'] > y]['clean_text'], model = min_coherence['model'].iloc[n])
    corpus = results['text_corpus']
    id2word = results['id2word']
        
    lda_model = gensim.models.ldamodel.LdaModel(corpus=results['text_corpus'],
                                           id2word=results['id2word'],
                                           num_topics= min_coherence['Topics'].iloc[n],
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha= min_coherence['Alpha'].iloc[n],
                                           eta = min_coherence['Beta'].iloc[n],
                                           per_word_topics=True)
    
    res1 = data_all['clean_text'].apply(lambda x: LDA_modelprocessing.lda_get_prob(lda_model, x, topic_order = 0))
    res2 = data_all['clean_text'].apply(lambda x: LDA_modelprocessing.lda_get_prob(lda_model, x, topic_order = 1))
    type(res1)
    data_all[nam_res[n]] = res1 - res2
    n = n+1

In [45]:
data_all.to_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\processed.json')
data_all.head()

Unnamed: 0,Review,Rating,clean_text,ntokens,Sent_Blob,dif_1,dif_2,dif_3,dif_4,dif_5,dif_6,dif_7,dif_8
0,nice hotel expensive parking got good deal sta...,4,expensive parking get deal anniversary arrive ...,59,0.208744,0.058078,0.15168,0.216526,0.043085,0.13156,0.409024,0.075901,0.128529
1,ok nothing special charge diamond member hilto...,2,nothing special charge diamond member decide c...,188,0.214923,0.211492,0.46451,0.311758,0.005613,0.07063,0.317075,0.406749,0.231098
2,nice rooms not 4* experience hotel monaco seat...,3,experience level positive large bathroom suite...,156,0.29442,0.064353,0.276058,0.080369,0.143949,0.005505,0.025231,0.062787,0.014113
3,"unique, great stay, wonderful time hotel monac...",5,unique wonderful excellent short stroll main d...,62,0.504825,0.288621,0.048258,0.179876,0.000846,0.015813,0.10763,0.054168,0.147735
4,"great stay great stay, went seahawk game aweso...",5,go seahawk game awesome downfall view building...,140,0.384615,0.032695,0.254202,0.21394,0.189096,0.104292,0.142262,0.216948,0.047606


In [46]:
model_results = pd.concat([pd.DataFrame(min_coherence).reset_index(),
                          pd.DataFrame([data_all['dif_1'].describe(),data_all['dif_2'].describe(),data_all['dif_3'].describe(),data_all['dif_4'].describe(),data_all['dif_5'].describe(),data_all['dif_6'].describe(),data_all['dif_7'].describe(),data_all['dif_8'].describe()]).reset_index()
                          ], axis=1).drop(columns=['index'])
model_results = model_results.drop(columns =['token_num','model', 'count'])
model_results.to_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\min_coh.json')

In [2]:
model_results = pd.read_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\min_coh.json')
df = model_results

In [4]:
vals =[ df.Coherence, df['mean'], df['std'], df['min'], df['25%'], df['50%'], df['75%'], df['max']]
font_color = ['darkslategray'] + ['darkslategray'] +['darkslategray']+['darkslategray'] + [['rgb(192,0,0)' if v < -3 else 'darkslategray' for v in vals[0]]]+[['rgb(192,0,0)' if v > 0.3 else 'darkslategray' for v in vals[1]]]+['darkslategray']+['darkslategray']+['darkslategray']+[['rgb(192,0,0)' if v > 0.25 else 'darkslategray' for v in vals[5]]]+['darkslategray']+['darkslategray']

fig = go.Figure(data=[go.Table(
    header=dict(values=list(df.columns),
                line_color='white', fill_color='white',
                align='left'),
    cells=dict(values=[df.Data, df.Topics, df.Alpha, df.Beta, vals[0],vals[1],vals[2],vals[3],vals[4],vals[5],vals[6],vals[7]],
               line_color='white', fill_color='white',
               align='left',
               format = [None, None, None, ",.2f", ",.4f", ",.4f", ",.4f", ",.4f", ",.4f", ",.4f", ",.4f", ",.4f"],
               font=dict(color=font_color)
              ))
])
fig.update_layout(
    title = {
        'text' : 'Minimum Coherence for data/model combination'
        + '<br>' +  '<span style="font-size: 18px;">Evaluation on how likley they choose the dominant topic over second</span>',
        'y' : 0.9},
    title_font_size = 24,
    font=dict(
        family="Calibri",
        size=15,)
    )
fig.show()

In [26]:
data_all = pd.read_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\processed.json')

In [28]:
data_all.head()

Unnamed: 0,Review,Rating,clean_text,ntokens,Sent_Blob,dif_1,dif_2,dif_3,dif_4,dif_5,dif_6,dif_7,dif_8
0,nice hotel expensive parking got good deal sta...,4,expensive parking get deal anniversary arrive ...,59,0.208744,0.058078,0.15168,0.216526,0.043085,0.13156,0.409024,0.075901,0.128529
1,ok nothing special charge diamond member hilto...,2,nothing special charge diamond member decide c...,188,0.214923,0.211492,0.46451,0.311758,0.005613,0.07063,0.317075,0.406749,0.231098
2,nice rooms not 4* experience hotel monaco seat...,3,experience level positive large bathroom suite...,156,0.29442,0.064353,0.276058,0.080369,0.143949,0.005505,0.025231,0.062787,0.014113
3,"unique, great stay, wonderful time hotel monac...",5,unique wonderful excellent short stroll main d...,62,0.504825,0.288621,0.048258,0.179876,0.000846,0.015813,0.10763,0.054168,0.147735
4,"great stay great stay, went seahawk game aweso...",5,go seahawk game awesome downfall view building...,140,0.384615,0.032695,0.254202,0.21394,0.189096,0.104292,0.142262,0.216948,0.047606


In [234]:
results = LDA_contextprocessing.context_processing(data_all[data_all['ntokens'] > 20]['clean_text'], model = 'bi_gram')
lda_model = gensim.models.ldamodel.LdaModel(corpus=results['text_corpus'],
                                           id2word=results['id2word'],
                                           num_topics= 7,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha= 'auto',
                                           eta = 0.91,
                                           per_word_topics=True)


In [235]:
topics = []
for topic_id in range(lda_model.num_topics):
    topk = lda_model.show_topic(topic_id, 15)
    topk_words = [ w for w, _ in topk ]
    topics.append([topic_id, ' '.join([ w for w, _ in lda_model.show_topic(topic_id, 10)])])
    print('{}: {}'.format(topic_id, ' '.join(topk_words)))
topics = pd.DataFrame(topics, columns =['Topic_id', 'Topic_words'])
#Manual assignment of the topics


0: walk breakfast area price small free city place night restaurant close modern street helpful minute
1: noise shuttle construction hear traffic loud noisy disturb earplug light_sleeper siren sleep ear_plug waikiki pioneer_square
2: food go pool people like restaurant drink get bar really want water eat buffet place
3: bed shower bathroom floor door small wall like open smell water window sleep look light
4: beach pool ocean villa chair casino excursion tour island sand water swim beautiful wave ride
5: excellent lovely friendly fantastic wonderful return view feel perfect love experience helpful recommend enjoy beautiful
6: check tell ask say book desk arrive leave get go give guest pay call know


In [19]:
# Visualize the topics
pyLDAvis.gensim.prepare(lda_model, results['text_corpus'], results['id2word'])

In [236]:
topics = pd.DataFrame(topics, columns =['Topic_id', 'Topic_words'])
topics['Topic_name'] =['Location', 'Disturbance', 'Food / Drinks', 'Room', 'Anmeties',' Good experince','Bad service']
topics

Unnamed: 0,Topic_id,Topic_words,Topic_name
0,0,walk breakfast area price small free city plac...,Location
1,1,noise shuttle construction hear traffic loud n...,Disturbance
2,2,food go pool people like restaurant drink get ...,Food / Drinks
3,3,bed shower bathroom floor door small wall like...,Room
4,4,beach pool ocean villa chair casino excursion ...,Anmeties
5,5,excellent lovely friendly fantastic wonderful ...,Good experince
6,6,check tell ask say book desk arrive leave get go,Bad service


In [266]:
data_all['top1'] = data_all['clean_text'].apply(lambda x: LDA_modelprocessing.lda_get_topics(lda_model, x, topics, topic_order = 0))
data_all['top1_prob'] = data_all['clean_text'].apply(lambda x: LDA_modelprocessing.lda_get_prob(lda_model, x, topic_order = 0))
data_all['top2'] = data_all['clean_text'].apply(lambda x: LDA_modelprocessing.lda_get_topics(lda_model, x, topics, topic_order = 1))
data_all['top2_prob'] = data_all['clean_text'].apply(lambda x: LDA_modelprocessing.lda_get_prob(lda_model, x, topic_order = 1))
data_all.to_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\processed.json')

<a id='4.0'></a>
## 4. Visualization

In [21]:
hist_data = [data_all['Sent_Blob']]
group_labels = ['TextBlob']

fig = ff.create_distplot(hist_data, group_labels, show_hist = False)
fig.add_vline(x = -0.6, line_width = 1, line_dash = 'dash', line_color ='grey')
fig.add_vline(x = -0.2, line_width = 1, line_dash = 'dash', line_color ='grey')
fig.add_vline(x =  0.2, line_width = 1, line_dash = 'dash', line_color ='grey')
fig.add_vline(x =  0.6, line_width = 1, line_dash = 'dash', line_color ='grey')
fig.add_annotation(x = -0.8, y = 3, text = '1 Star: x <= -0.6', showarrow = False)
fig.add_annotation(x = -0.4, y =3, text = '2 Star: -0.6 < x < -0.2', showarrow = False)
fig.add_annotation(x = 0, y =3, text = '3 Star: -0.2 <= x <= 0.2', showarrow = False)
fig.add_annotation(x = 0.4, y = 3, text = '4 Star: 0.2 < x < 0.6', showarrow = False)
fig.add_annotation(x = 0.8, y = 3, text = '5 Start: x >= 0.6', showarrow = False)
fig.update_layout(
    showlegend = False,
    title = {
        'text' : 'Distribution of TextBlob Rating',
        'y' : 0.9},
    title_font_size = 24,
    font=dict(
        family="Calibri",
        size=15,)
    )
fig.show()

In [31]:
# create a list of our conditions
conditions = [
    (data_all['Sent_Blob'] <= -0.6),
    (data_all['Sent_Blob'] > -0.6) & (data_all['Sent_Blob'] < -0.2),
    (data_all['Sent_Blob'] >= -0.2) & (data_all['Sent_Blob'] <= 0.2),
    (data_all['Sent_Blob'] >  0.2) & (data_all['Sent_Blob'] < 0.6),
    (data_all['Sent_Blob'] >= 0.6)
    ]

# create a list of the values we want to assign for each condition
values = [1, 2, 3, 4, 5]

# create a new column and use np.select to assign values to it using our li sts as arguments
data_all['blob_rating'] = np.select(conditions, values)

In [32]:
# create a list of our conditions
conditions = [
    (data_all['blob_rating'] == 1),
    (data_all['blob_rating'] == 2),
    (data_all['blob_rating'] == 3),
    (data_all['blob_rating'] == 4),
    (data_all['blob_rating'] == 5)
    ]

# create a list of the values we want to assign for each condition
values = ['negative','negative','negative','positive','positive']

# create a new column and use np.select to assign values to it using our li sts as arguments
data_all['Sentiment_Blob'] = np.select(conditions, values)

In [37]:
# create a list of our conditions
conditions = [
    (data_all['Rating'] == 1),
    (data_all['Rating'] == 2),
    (data_all['Rating'] == 3),
    (data_all['Rating'] == 4),
    (data_all['Rating'] == 5)
    ]

# create a list of the values we want to assign for each condition
values = ['negative','negative','negative','positive','positive']

# create a new column and use np.select to assign values to it using our li sts as arguments
data_all['Sentiment_Human'] = np.select(conditions, values)
data_all.to_json(r'C:\Users\unters1\Desktop\Projekt\NLP\Topic Analysis\Data\processed.json')

In [39]:
da = pd.concat([data_all.loc[:, ['blob_rating', 'Review']].groupby("blob_rating").count(), data_all.loc[:, ['Rating', 'clean_text']].groupby("Rating").count()],  axis=1 ).rename(columns = {'Review':'Machine', 'clean_text':'Human'} )
da = da.reset_index().rename(columns = {'index':'Rating'} )

In [40]:
rt=["1 Star", "2 Stars","3 Stars", "4 Stars", "5 Stars"]

fig = go.Figure(data=[
    go.Bar(name='Human Rating', x=da['Human'], y=rt, orientation='h', texttemplate="%{x}", textposition="outside", textangle=0, textfont_color="gray", hoverinfo='skip'),
    go.Bar(name='TextBlob Rating', x=da['Machine'], y=rt, orientation='h', texttemplate="%{x}", textposition="outside", textangle=0, textfont_color="gray", hoverinfo='skip')
])
# Change the bar mode
fig.update_layout(
    font=dict(
        family="Calibri",
        size=15,),
    title={
        'text': "Comparison of Rating by Human and TextBlob",
        'y':0.9,},
    title_font_size= 24,
    xaxis=dict(
        showticklabels=False
    ),
    yaxis=dict(
        title='Rating'
    ),
    barmode='group',
    bargap=0.15, # gap between bars of adjacent location coordinates.
    bargroupgap=0.1 # gap between bars of the same location coordinate.
)
fig.show()

In [196]:
data_all['only_one'] = np.ones(len(data_all), dtype=np.int32)
da = data_all.loc[:, ['blob_rating', 'Rating', 'only_one']].rename(columns = {'Rating':'Human Rating', 'blob_rating':'Rating TextBlob'}).pivot_table(values = 'only_one', index = 'Rating TextBlob', columns = 'Human Rating', aggfunc = np.sum, margins = True, margins_name = 'Total number in abs')
for i in range(1,6):
    da[i][0:5] = da[i][0:5]/da.iloc[5][i]

In [231]:
fig = go.Figure(data=[go.Table(
    header=dict(values=list(da.reset_index().columns),
                line_color='white', fill_color='white',
                align='left'),
    cells=dict(values=da.reset_index().T,
               line_color='white',
               align='left',
               format = [None, [",.2%",",.2%",",.2%",",.2%",",.2%", ""], [",.2%",",.2%",",.2%",",.2%",",.2%", ""], [",.2%",",.2%",",.2%",",.2%",",.2%", ""], [",.2%",",.2%",",.2%",",.2%",",.2%", ""], [",.2%",",.2%",",.2%",",.2%",",.2%", ""], None],
               fill=dict(color=['white', ['lightgray','white','white','white','white','white'], ['white','lightgray','white','white','white','white'], ['white','white','lightgray','white','white','white'], ['white','white','white','lightgray','white','white'], ['white','white','white','white','lightgray','white'],'white'])
              ))
])
fig.update_layout(
    title = {
        'text' : 'Categorisation of Reviews from TextBlob relative to Human by stars'
        + '<br>' +  '<span style="font-size: 18px;"></span>',
        'y' : 0.9},
    title_font_size = 24,
    font=dict(
        family="Calibri",
        size=15,)
    )
fig.show()

In [222]:
da1 = data_all.loc[:, ['Sentiment_Blob', 'Sentiment_Human', 'only_one']].pivot_table(values = 'only_one', index = 'Sentiment_Blob', columns = 'Sentiment_Human', aggfunc = np.sum, margins = True)
da1['negative'][0:2] = da1['negative'][0:2]/da1.iloc[2][0]
da1['positive'][0:2] = da1['positive'][0:2]/da1.iloc[2][1]

In [232]:
fig = go.Figure(data=[go.Table(
    header=dict(values=list(da1.reset_index().columns),
                line_color='white', fill_color='white',
                align='left'),
    cells=dict(values=da1.reset_index().T,
               line_color='white',
               align='left',
               format = [None, [",.2%",",.2%",None], [",.2%",",.2%",None], None],
               fill=dict(color=['white', ['lightgray','white','white'],['white','lightgray','white'],'white']),
              ))
])
fig.update_layout(
    title = {
        'text' : 'Categorisation of Reviews from TextBlob to relative to Human by sentiment'
        + '<br>' +  '<span style="font-size: 18px;">Sentiment: negative <= 3 Stars, positive > 3 Stars </span>',
        'y' : 0.9},
    title_font_size = 24,
    font=dict(
        family="Calibri",
        size=15,)
    )
fig.show()

In [267]:
x1 = data_all[data_all['top1'] == topics['Topic_name'][(0)]]['top1_prob']
x2 = data_all[data_all['top1'] == topics['Topic_name'][(1)]]['top1_prob']
x3 = data_all[data_all['top1'] == topics['Topic_name'][(2)]]['top1_prob']
x4 = data_all[data_all['top1'] == topics['Topic_name'][(3)]]['top1_prob']
x5 = data_all[data_all['top1'] == topics['Topic_name'][(4)]]['top1_prob']
x6 = data_all[data_all['top1'] == topics['Topic_name'][(5)]]['top1_prob']
x7 = data_all[data_all['top1'] == topics['Topic_name'][(6)]]['top1_prob']

hist_data = [x1,x2,x3,x4,x5,x6,x7]
group_lables = ['Location', 'Disturbance', 'Food / Drinks', 'Room', 'Anmeties',' Good experince','Bad service']

In [268]:
fig = ff.create_distplot(hist_data, group_lables, show_hist = False)
fig.update_layout(
    font=dict(
        family="Calibri",
        size=15,),
    title={
        'text': "Probability of dominant topic",
        'y':0.9,},
    title_font_size= 24,
    xaxis=dict(
        title='Probability of Topic'
    ),
    yaxis=dict(
        title='Density'
    )
)
fig.show()

In [269]:
x1 = data_all[(data_all['top1'] == topics['Topic_name'][(0)]) & (data_all['Rating'] < 3)]['top1_prob']
x2 = data_all[(data_all['top1'] == topics['Topic_name'][(1)]) & (data_all['Rating'] < 3)]['top1_prob']
x3 = data_all[(data_all['top1'] == topics['Topic_name'][(2)]) & (data_all['Rating'] < 3)]['top1_prob']
x4 = data_all[(data_all['top1'] == topics['Topic_name'][(3)]) & (data_all['Rating'] < 3)]['top1_prob']
x5 = data_all[(data_all['top1'] == topics['Topic_name'][(4)]) & (data_all['Rating'] < 3)]['top1_prob']
x6 = data_all[(data_all['top1'] == topics['Topic_name'][(5)]) & (data_all['Rating'] < 3)]['top1_prob']
x7 = data_all[(data_all['top1'] == topics['Topic_name'][(6)]) & (data_all['Rating'] < 3)]['top1_prob']

hist_data = [x1,x2,x3,x4,x5,x6,x7]
group_lables = ['Location', 'Disturbance', 'Food / Drinks', 'Room', 'Anmeties',' Good experince','Bad service']

In [270]:
fig = ff.create_distplot(hist_data, group_lables, show_hist = False)
fig.update_layout(
    font=dict(
        family="Calibri",
        size=15,),
    title={
        'text': "Probability of dominant topic for negative ratings (<3)",
        'y':0.9,},
    title_font_size= 24,
    xaxis=dict(
        title='Probability of Topic'
    ),
    yaxis=dict(
        title='Density'
    )
)
fig.show()

In [276]:
df2 = data_all[data_all['Rating'] <= 2].loc[:, ['only_one', 'top1','top2']].rename(columns ={'top2':'Second topic'}).pivot_table(index='Second topic', columns='top1', values='only_one', aggfunc =np.sum, margins = True, margins_name = 'Total', fill_value = '-')

In [277]:
col_dia = ['white', ['lightgray','white','white'],
 ['white','lightgray','white'],['white','white','lightgray','white'],
 ['white','white','white','lightgray','white'],['white','white','white','white','lightgray','white'],
['white','white','white','white','white','lightgray','white'],
['white','white','white','white','white','white','lightgray','white'], 'white']

In [280]:
fig = go.Figure(data=[go.Table(
    header=dict(values=list(df2.reset_index().columns),
                line_color='white', fill_color='white',
                align='left'),
    cells=dict(values=df2.reset_index().T,
               line_color='white',
               align='left',
               fill=dict(color=col_dia),
              ))
])
fig.update_layout(
    title = {
        'text' : 'Categorisation dominant and second topic'
        + '<br>' +  '<span style="font-size: 18px;">for negative reviews (1 or 2 Stars)</span>',
        'y' : 0.9},
    title_font_size = 24,
    font=dict(
        family="Calibri",
        size=15,)
    )
fig.show()