# Research question 3 - topic detection STTM (GSDMM algorithm)

Topic needs to be detected for each sentence. Due to short sentences I will use short text topic modeling  algorithm GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) which is similar to LDA (Latent Dirichlet Allocation) but detect only one topic per sentence.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Styles" data-toc-modified-id="Styles-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Styles</a></span></li><li><span><a href="#Load-file" data-toc-modified-id="Load-file-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Load file</a></span></li><li><span><a href="#Remove-most-frequent-words" data-toc-modified-id="Remove-most-frequent-words-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Remove most frequent words</a></span></li><li><span><a href="#Sample-the-data-set" data-toc-modified-id="Sample-the-data-set-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Sample the data set</a></span></li><li><span><a href="#DMM-model" data-toc-modified-id="DMM-model-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>DMM model</a></span></li><li><span><a href="#GSDMM-model" data-toc-modified-id="GSDMM-model-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>GSDMM model</a></span></li><li><span><a href="#Most-frequent-words-in-each-topic" data-toc-modified-id="Most-frequent-words-in-each-topic-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Most frequent words in each topic</a></span></li><li><span><a href="#Assign-topic-name" data-toc-modified-id="Assign-topic-name-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Assign topic name</a></span></li><li><span><a href="#Assign-topic-name--and-probability-of-topic-to-each-sentence" data-toc-modified-id="Assign-topic-name--and-probability-of-topic-to-each-sentence-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Assign topic name  and probability of topic to each sentence</a></span></li><li><span><a href="#Plot-topics" data-toc-modified-id="Plot-topics-1.11"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Plot topics</a></span></li><li><span><a href="#Merge-sentence-topic-with-data-frame" data-toc-modified-id="Merge-sentence-topic-with-data-frame-1.12"><span class="toc-item-num">1.12&nbsp;&nbsp;</span>Merge sentence topic with data frame</a></span></li></ul></li></ul></div>

## Setup

### Imports

In [1]:
import numpy as np
import pandas as pd

import pickle
import operator
from tqdm import tqdm
from multiprocessing import Pool
import multiprocessing
from functools import partial


from gensim.utils import simple_preprocess
from nltk import FreqDist

from gsdmm.gsdmm import MovieGroupProcess
from GPyM_TM import GSDMM

# Plotting tools
import pyLDAvis
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.simplefilter('ignore', category=FutureWarning)

Please use `tqdm.notebook.*` instead of `tqdm._tqdm_notebook.*`
  from tqdm._tqdm_notebook import tqdm_notebook
  from pandas import Panel


### Styles

In [2]:
def set_plot_styles(styles):
    mpl.rcParams.update(mpl.rcParamsDefault)
    plt.style.use(styles)
    
set_plot_styles(['mplstyle.config'])
color = sns.color_palette('tab20')

### Load file

In [3]:
data = pd.read_pickle('data_preprocessed_verbs.pkl')

### Remove most frequent words

In [4]:
def freq_words(content):
    all_words = [word for sentences in content for sentence in sentences for word in simple_preprocess(str(sentence), deacc=True)]
    freq_dist = FreqDist(all_words)
    return freq_dist

In [5]:
words_freq = freq_words(data.data_lemmatized_freq.tolist())

In [6]:
most_frequent_words = set([word for word, v in words_freq.most_common(5)])
most_frequent_words

{'apartment', 'host', 'location', 'place', 'stay'}

In [7]:
def remove_most_freq_words(text):
    return [word for sentence in text for word in sentence.split() if word not in most_frequent_words]

In [8]:
data['data_lemmatized_no_freq'] = data.data_lemmatized_freq.progress_apply(lambda review: [remove_most_freq_words(sentence) 
                                                                                     for sentence in review])

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3557399.0), HTML(value='')))




### Sample the data set

In [4]:
def stratified_sample_df(df, col, n_samples):
    n = min(n_samples, df[col].value_counts().min())
    df_ = df.groupby(col).apply(lambda x: x.sample(n))
    df_.index = df_.index.droplevel(0)
    return df_

In [5]:
data_sample = stratified_sample_df(data, 'type', 1000000)
print(data_sample.shape)
data_sample.head()

(1780080, 29)


Unnamed: 0,id,date,comments,host_id,neighbourhood_cleansed,city,latitude,longitude,number_of_reviews,first_review,...,year,sentiment_from_rating,sentiment_reviews,sentiment_reviews_textblob,comments_to_sentences,sentiment_sentences,type,tokens,data_lemmatized,data_lemmatized_freq
3139507,1133331,2014-07-24,"The apartment was great! amazing location, spa...",4141225,Copacabana,Rio de Janeiro,-22.98605,-43.18885,61,2013-08-14,...,2014,pos,1,1,"[The apartment was great., amazing location sp...","[1, 1, 0]",Non-Western,"[[the, apartment, was, great], [amazing, locat...","[[apartment], [location, furnish], []]","[[apartment], [location, furnish], []]"
2688253,36666,2011-01-19,"Great, great place to stay on Caye Caulker. Ve...",157752,Belize Islands,Belize,17.74786,-88.02398,246,2010-11-27,...,2011,pos,1,1,"[Great great place to stay on Caye Caulker., V...","[1, 1, 1, 1, 0]",Non-Western,"[[great, great, place, to, stay, on, caye, cau...","[[place, stay], [lay, lot, thing, view, sunset...","[[place, stay], [lay, lot, thing, view, sunset..."
3462760,23830722,2018-06-25,Hiro was extremely hospitable and the place wa...,178741967,Komae Shi,Tokyo,35.62935,139.57184,92,2018-04-03,...,2018,pos,1,1,[Hiro was extremely hospitable and the place w...,"[1, 1]",Non-Western,"[[hiro, was, extremely, hospitable, and, the, ...","[[place, value], [would, recommend, place]]","[[place, value], [would, recommend, place]]"
2735520,338752,2016-10-02,"Wonderful apartment, centrally located, Martin...",1714257,Retiro,Buenos Aires,-34.59563,-58.37487,106,2012-04-03,...,2016,pos,1,1,[Wonderful apartment centrally located Martin ...,[1],Non-Western,"[[wonderful, apartment, centrally, located, ma...","[[apartment, locate, host]]","[[apartment, locate, host]]"
2795383,17252135,2019-07-15,"Very spacious and clean, and Manuela was avail...",116263418,Palermo,Buenos Aires,-34.58734,-58.44043,79,2017-02-27,...,2019,pos,1,1,[Very spacious and clean and Manuela was avail...,"[0, 1]",Non-Western,"[[very, spacious, and, clean, and, manuela, wa...","[[question], [thank, experience]]","[[question], [thank, experience]]"


In [6]:
data_sample = [sentence for review in data_sample.data_lemmatized_freq.tolist() for sentence in review]

In [4]:
data = [sentence for review in data.data_lemmatized_freq.tolist() for sentence in review]

In [5]:
len(data)

14491096

### DMM model

In [13]:
corpus = data_sample

nTopics=10

data_dmm = GSDMM.DMM(corpus, nTopics, iters=15) # Initialize the object, with default parameters.

data_dmm.topicAssigmentInitialise() # Performs the inital document assignments and counts
data_dmm.inference()

psi, theta, selected_psi, selected_theta = data_dmm.worddist() # Determines and stores the psi, theta and selected_psi and selected_theta values
   
finalAssignments = data_dmm.writeTopicAssignments() # Records the final topic assignments for the documents

coherence_topwords = data_dmm.writeTopTopicalWords(finalAssignments) # Record the top words for each document

score = data_dmm.coherence(coherence_topwords, len(finalAssignments)) #Calculates and stores the coherence

print('Final number of topics found: ' + str(len(finalAssignments)))

corpus=2499313, words=28883, K=10, a=0.100000, b=0.100000, nTopWords=10, iters=15
iteration: 0
iteration: 1
iteration: 2
iteration: 3
iteration: 4
iteration: 5
iteration: 6
iteration: 7
iteration: 8
iteration: 9
iteration: 10
iteration: 11
iteration: 12
iteration: 13
iteration: 14
[0 1 2 3 4 5 6 7 8 9]
area tip lot city time recommendation neighborhood thing information restaurant 
view room space city pool balcony area house day building 
time home experience thank trip communication hospitality family day house 
restaurant shop distance bar store lot area station neighborhood food 
room bed bathroom space kitchen shower bedroom water living area 
room space value family home people price time house money 
night noise street building room floor stair door parking day 
check communication time question response day arrival instruction message issue 
kitchen coffee breakfast water room machine towel touch tea amenity 
station minute bus subway walk train city line area tube 
average top

### GSDMM model

In [10]:
K = 10
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)
docs = data
vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
y = mgp.fit(docs, n_terms)

In stage 0: transferred 13020946 clusters with 10 clusters populated
In stage 1: transferred 12921708 clusters with 10 clusters populated
In stage 2: transferred 11867241 clusters with 10 clusters populated
In stage 3: transferred 9220580 clusters with 10 clusters populated
In stage 4: transferred 7435674 clusters with 10 clusters populated
In stage 5: transferred 6608629 clusters with 10 clusters populated
In stage 6: transferred 6184150 clusters with 10 clusters populated
In stage 7: transferred 5957327 clusters with 10 clusters populated
In stage 8: transferred 5823214 clusters with 10 clusters populated
In stage 9: transferred 5741733 clusters with 10 clusters populated
In stage 10: transferred 5690476 clusters with 10 clusters populated
In stage 11: transferred 5658635 clusters with 10 clusters populated
In stage 12: transferred 5632365 clusters with 10 clusters populated
In stage 13: transferred 5615232 clusters with 10 clusters populated
In stage 14: transferred 5600965 clusters

In [5]:
doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)

fractions = (np.array(mgp.cluster_doc_count)*100. / sum(mgp.cluster_doc_count))
np.set_printoptions(precision=2)
print('% of documents per topic:', fractions)
print('*'*20)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[::-1]
print('Most important topics (by number of docs inside):', top_index)
print('*'*20)

Number of documents per topic : [1482554 1011952 1956553 1141331 1105486 2188387 1550487  525709 2806511
  722126]
********************
% of documents per topic: [10.23  6.98 13.5   7.88  7.63 15.1  10.7   3.63 19.37  4.98]
********************
Most important topics (by number of docs inside): [8 5 2 6 0 3 4 1 9 7]
********************


### Most frequent words in each topic

In [6]:
def top_words(cluster_word_distribution, top_index, num_words):
    for index in top_index:
        print('Topic {} '.format(index))
        print(list(sorted(mgp.cluster_word_distribution[index].items(), key=operator.itemgetter(1), reverse=True))[:num_words])
        print('*'*20)

In [7]:
top_words(mgp.cluster_word_distribution, top_index, 20)

Topic 8 
[('stay', 1295034), ('would', 719143), ('place', 688079), ('recommend', 675898), ('time', 192031), ('apartment', 172922), ('enjoy', 143074), ('come', 136567), ('love', 127938), ('visit', 123150), ('host', 100237), ('thank', 92605), ('look', 79946), ('experience', 77063), ('book', 74497), ('family', 65854), ('night', 64889), ('location', 61151), ('friend', 59024), ('return', 56814)]
********************
Topic 5 
[('walk', 548638), ('location', 493315), ('station', 378442), ('restaurant', 345304), ('place', 253917), ('minute', 239630), ('subway', 202487), ('apartment', 183058), ('shop', 182013), ('distance', 170358), ('bus', 156759), ('locate', 155766), ('train', 152734), ('area', 135514), ('tube', 116906), ('neighborhood', 111113), ('lot', 107708), ('bar', 103739), ('city', 99053), ('store', 95029)]
********************
Topic 2 
[('location', 557386), ('place', 384577), ('apartment', 325618), ('stay', 200799), ('host', 198340), ('need', 131626), ('value', 97680), ('room', 96087

### Assign topic name

In [8]:
topic_dict = {}
topic_names = ['Experience',
               'Location',
               'Value for money',
               'Hospitality',
               'Communication with host',
               'Description accuracy',
               'Property - inside',
               'Property - surroundings',
               'Host advices',
               'Facilities']
for i, topic_num in enumerate(top_index):
    topic_dict[topic_num]=topic_names[i] 

In [9]:
topic_dict

{8: 'Experience',
 5: 'Location',
 2: 'Value for money',
 6: 'Hospitality',
 0: 'Communication with host',
 3: 'Description accuracy',
 4: 'Property - inside',
 1: 'Property - surroundings',
 9: 'Host advices',
 7: 'Facilities'}

In [11]:
with open('model_mgp_all_freq_all', 'wb') as f:
     pickle.dump(mgp, f)

In [4]:
import pickle
with open('model_mgp_all_freq_all', 'rb') as f:
     mgp= pickle.load(f)

### Assign topic name  and probability of topic to each sentence

In [20]:
#def create_topics_dataframe(data_text=data,  mgp=mgp, threshold=0.4, topic_dict=topic_dict):
    result = pd.DataFrame(columns=['text', 'topic', 'topic_prob'])
    with tqdm(total=len(data_text)) as pbar:
        for i, text in enumerate(data_text):
            result.at[i, 'text'] = text
            prob = mgp.choose_best_label(data_text[i])
            if prob[1] >= threshold:
                result.at[i, 'topic'] = topic_dict[prob[0]]
                result.at[i, 'topic_prob'] = prob[1]
            else:
                if len(text) != 0:
                    result.at[i, 'topic'] = 'Other'
                    result.at[i, 'topic_prob'] = prob[1]
                else:
                    result.at[i, 'topic'] = []
                    result.at[i, 'topic_prob'] = None
            pbar.update(1)
        return result

In [10]:
def assign_topic(itext, threshold, topic_dict):
    i, text = itext
    prob = mgp.choose_best_label(text)

    if prob[1] >= threshold:
        topic = topic_dict[prob[0]]
        topic_prob = prob[1]
        return [i, text, topic, topic_prob]
    else:
        if len(text) != 0:
            topic = 'Other'
            topic_prob = prob[1]
            return [i, text, topic, topic_prob]
        else:
            topic = []
            topic_prob = None
            return [i, text, topic, topic_prob]

def create_topics_dataframe(data_text=data, mgp=mgp, threshold=0, topic_dict=topic_dict):
    assign_topic_for_text = partial(assign_topic, threshold=threshold, topic_dict=topic_dict)
    
    with Pool(multiprocessing.cpu_count() - 1) as pool:
        processed_data = list(tqdm(pool.imap(assign_topic_for_text, enumerate(data_text)), total=len(data_text)))
        result_data = sorted(processed_data, key=lambda row: row[0])
        result = pd.DataFrame(result_data, columns=['i', 'text', 'topic', 'topic_prob'])
        result.drop(['i'], axis=1, inplace=True)
        return result

In [11]:
data_lemmatized_list = [sentence for review in data.data_lemmatized_freq.tolist() for sentence in review]

In [12]:
gsdmm_output_freq = create_topics_dataframe(data_lemmatized_list)

100%|██████████| 14491096/14491096 [46:38<00:00, 5177.64it/s] 


In [13]:
gsdmm_output_freq[gsdmm_output_freq.topic_prob>0.9].sort_values(by='topic_prob', ascending=False)

Unnamed: 0,text,topic,topic_prob
4040023,"[make, breakfast, morning, abundance, variety,...",Facilities,1.000000
9650360,"[arrival, fridge, milk, juice, bread, egg, yog...",Facilities,1.000000
14335452,"[location, apartment, walk, distance, tawarama...",Location,1.000000
4063736,"[supply, basic, bread, butter, tea, coffee, mi...",Facilities,1.000000
10204682,"[load, fridge, egg, fruit, bagel, yogurt, juic...",Facilities,1.000000
...,...,...,...
2126982,"[location, street, angel]",Location,0.900001
50742,"[location, street, angel]",Location,0.900001
1016519,"[location, street, angel]",Location,0.900001
6865942,"[pro, location, village]",Value for money,0.900000


In [15]:
print(gsdmm_output_freq[gsdmm_output_freq.topic=='Experience'].sort_values(by='topic_prob', ascending=False)['text'].tolist()[:100])

[['would', 'hesitate', 'recommend', 'evas', 'place', 'would', 'stay', 'heartbeat', 'come'], ['would', 'refer', 'colleague', 'future', 'look', 'place', 'stay', 'would', 'come', 'future', 'need', 'place', 'stay'], ['would', 'hesitation', 'recommend', 'property', 'would', 'enjoy', 'return', 'visit', 'future'], ['house', 'enjoy', 'stay', 'would', 'recommend', 'look', 'place', 'stay', 'would', 'plan', 'come', 'future'], ['would', 'hesitation', 'recommend', 'flat', 'melbourne', 'future', 'would', 'consider', 'stay', 'flat'], ['would', 'hesitate', 'return', 'niko', 'future', 'would', 'recommend', 'look', 'experience', 'price'], ['would', 'hope', 'return', 'future', 'would', 'hesitate', 'recommend', 'place'], ['would', 'recommend', 'stay', 'simone', 'heartbeat', 'return', 'future', 'look', 'stay'], ['would', 'hesitation', 'return', 'stay', 'future', 'recommend', 'accommodation', 'consider', 'visit'], ['would', 'hesitate', 'recommend', 'listing', 'would', 'stay', 'apartment', 'return', 'future'

In [16]:
with open('gsdmm_output_all_freq_all', 'wb') as f:
     pickle.dump(gsdmm_output_freq, f)

In [4]:
with open('gsdmm_output_all_freq_all', 'rb') as f:
     gsdmm_output_freq = pickle.load(f)

### Plot topics

In [17]:
import pandas as pd
import pyLDAvis
import math

docs = [sentence for review in data.data_lemmatized_freq.tolist() for sentence in review]
vocab = set(x for doc in docs for x in doc)

def prepare_data(mgp):
    vocabulary = list(vocab)
    doc_topic_dists = [mgp.score(doc) for doc in docs]
    for doc in doc_topic_dists:
        for f in doc:
            assert not isinstance(f, complex)

    doc_lengths = [len(doc) for doc in docs]
    term_counts_map = {}
    for doc in docs:
        for term in doc:
            term_counts_map[term] = term_counts_map.get(term, 0) + 1
    term_counts = [term_counts_map[term] for term in vocabulary]
    doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
    doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
    for doc in doc_topic_dists2:
        for f in doc:
            assert not isinstance(f, complex)
    
    assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
    matrix = []
    for cluster in mgp.cluster_word_distribution:
        total = sum([occurance for word, occurance in cluster.items()])
        assert not math.isnan(total)
        # assert total > 0
        if total == 0:
            row = [(1 / len(vocabulary))] * len(vocabulary)   # <--- The discussed workaround is here
        else:
            row = [cluster.get(term, 0) / total for term in vocabulary]
        for f in row:
            assert not isinstance(f, complex)
        matrix.append(row)
    return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    return vis_data

vis_data = prepare_visualization_data(mgp)

%matplotlib inline
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
  log_lift = np.log(topic_term_dists / term_proportion)
  log_ttd = np.log(topic_term_dists)


### Merge sentence topic with data frame 

In [17]:
data['sentence_count'] = data.tokens.apply(lambda x: len(x))

In [18]:
data['first_sentence_index'] = data['sentence_count'].shift().cumsum().fillna(0).astype(int)

In [19]:
sentence_topics = gsdmm_output_freq.topic.tolist()

In [20]:
data['sentence_topic'] = data[['first_sentence_index', 'sentence_count']]\
    .progress_apply(lambda row: sentence_topics[row['first_sentence_index'] : (row['first_sentence_index'] + row['sentence_count'])], axis = 1)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3557399.0), HTML(value='')))




In [21]:
sentence_topics_prob = gsdmm_output_freq.topic_prob.tolist()

In [22]:
data['sentence_topic_prob'] = data[['first_sentence_index', 'sentence_count']]\
    .progress_apply(lambda row: sentence_topics_prob[row['first_sentence_index'] : (row['first_sentence_index'] + row['sentence_count'])], axis = 1)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3557399.0), HTML(value='')))




In [23]:
data.head()

Unnamed: 0,id,date,comments,host_id,neighbourhood_cleansed,city,latitude,longitude,number_of_reviews,first_review,...,comments_to_sentences,sentiment_sentences,type,tokens,data_lemmatized,data_lemmatized_freq,sentence_count,first_sentence_index,sentence_topic,sentence_topic_prob
0,13913,2010-08-18,My girlfriend and I hadn't known Alina before ...,54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,[My girlfriend and I had not known Alina befor...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 1]",Western,"[[my, girlfriend, and, had, not, known, alina,...","[[girlfriend, know, take, leap, faith, rent], ...","[[girlfriend, know, take, leap, faith, rent], ...",10,0,"[Hospitality, Value for money, Value for money...","[0.7815883986461568, 0.19150870152390334, 0.30..."
1,13913,2011-07-11,Alina was a really good host. The flat is clea...,54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,"[Alina was a really good host., The flat is cl...","[0, 0, 0]",Western,"[[alina, was, really, good, host], [the, flat,...","[[host], [finsbury, park, station], [recommend]]","[[host], [finsbury, park, station], [recommend]]",3,10,"[Hospitality, Location, Experience]","[0.31272632303744097, 0.999029223577597, 0.876..."
2,13913,2011-09-13,Alina is an amazing host. She made me feel rig...,54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,"[Alina is an amazing host., She made me feel r...","[1, 0, 1, 1, 0, 0, 1]",Western,"[[alina, is, an, amazing, host], [she, made, m...","[[host], [make, feel, home], [hang, friend, st...","[[host], [make, feel, home], [hang, friend, st...",7,13,"[Hospitality, Hospitality, Hospitality, Facili...","[0.31272632303744097, 0.9929232579055125, 0.72..."
3,13913,2011-10-03,"Alina's place is so nice, the room is big and ...",54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,[Alina s place is so nice the room is big and ...,"[1, 1, 0, 1]",Western,"[[alina, place, is, so, nice, the, room, is, b...","[[room, bed], [host, make, need, instance, put...","[[room, bed], [host, make, need, instance, put...",4,20,"[Property - inside, Facilities, Host advices, ...","[0.7272363775074547, 0.5388052616205787, 0.986..."
4,13913,2011-10-09,"Nice location in Islington area, good for shor...",54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,[Nice location in Islington area good for shor...,"[1, 1]",Western,"[[nice, location, in, islington, area, good, f...","[[location, area, business, trip], [host]]","[[location, area, business, trip], [host]]",2,24,"[Value for money, Hospitality]","[0.8053933270336349, 0.31272632303744097]"


In [24]:
data.comments_to_sentences[5]

['I am very happy to have been Alina s guest.',
 'we have had great time in London and enjoyed our stay.',
 'Alina is a great host we felt us so welcomed by her.',
 'Alina s house location is very convenient it is only min walk to Finsbury Park tube station and also a direct Picadilly line to Heathrow Airport in case yu have an early departure you can use the opportunity to sleep a bit in the train.',
 'The flat itself is very nice and clean and comfortable especially the double-bed with new mattress I slept like a newborn And also the red sofa on the small roof terrace is great I enjoyed the last night London sky.',
 'To all who is going to visit London I highly reccomend Alina and her beautiful house to stay in.',
 'Alina thank you so much and I hope to see you one day again']

In [25]:
data.sentence_topic[5]

['Hospitality',
 'Experience',
 'Hospitality',
 'Location',
 'Property - inside',
 'Experience',
 'Experience']

In [26]:
with open('data_with_topics_all_freq_all', 'wb') as f:
    pickle.dump(data, f)