# Research question 3 - topic detection STTM (GSDMM algorithm)

Topic needs to be detected for each sentence. Due to short sentences I will use short text topic modeling  algorithm GSDMM (Gibbs Sampling Dirichlet Multinomial Mixture) which is similar to LDA (Latent Dirichlet Allocation) but detect only one topic per sentence.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Styles" data-toc-modified-id="Styles-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Styles</a></span></li><li><span><a href="#Load-file" data-toc-modified-id="Load-file-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Load file</a></span></li><li><span><a href="#Remove-most-frequent-words" data-toc-modified-id="Remove-most-frequent-words-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Remove most frequent words</a></span></li><li><span><a href="#Sample-the-data-set" data-toc-modified-id="Sample-the-data-set-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Sample the data set</a></span></li><li><span><a href="#DMM-model" data-toc-modified-id="DMM-model-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>DMM model</a></span></li><li><span><a href="#GSDMM-model" data-toc-modified-id="GSDMM-model-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>GSDMM model</a></span></li><li><span><a href="#Most-frequent-words-in-each-topic" data-toc-modified-id="Most-frequent-words-in-each-topic-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Most frequent words in each topic</a></span></li><li><span><a href="#Assign-topic-name" data-toc-modified-id="Assign-topic-name-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Assign topic name</a></span></li><li><span><a href="#Assign-topic-name--and-probability-of-topic-to-each-sentence" data-toc-modified-id="Assign-topic-name--and-probability-of-topic-to-each-sentence-1.10"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Assign topic name  and probability of topic to each sentence</a></span></li><li><span><a href="#Plot-topics" data-toc-modified-id="Plot-topics-1.11"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Plot topics</a></span></li><li><span><a href="#Merge-sentence-topic-with-data-frame" data-toc-modified-id="Merge-sentence-topic-with-data-frame-1.12"><span class="toc-item-num">1.12&nbsp;&nbsp;</span>Merge sentence topic with data frame</a></span></li></ul></li></ul></div>

## Setup

### Imports

In [1]:
import numpy as np
import pandas as pd

import pickle
import operator
from tqdm import tqdm
from multiprocessing import Pool
import multiprocessing
from functools import partial


from gensim.utils import simple_preprocess
from nltk import FreqDist

from gsdmm.gsdmm import MovieGroupProcess
from GPyM_TM import GSDMM

# Plotting tools
import pyLDAvis
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from tqdm._tqdm_notebook import tqdm_notebook
tqdm_notebook.pandas()

import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)
warnings.simplefilter('ignore', category=FutureWarning)

Please use `tqdm.notebook.*` instead of `tqdm._tqdm_notebook.*`
  from tqdm._tqdm_notebook import tqdm_notebook
  from pandas import Panel


### Styles

In [2]:
def set_plot_styles(styles):
    mpl.rcParams.update(mpl.rcParamsDefault)
    plt.style.use(styles)
    
set_plot_styles(['mplstyle.config'])
color = sns.color_palette('tab20')

### Load file

In [3]:
data = pd.read_pickle('data_preprocessed_verbs.pkl')

### Remove most frequent words

In [4]:
def freq_words(content):
    all_words = [word for sentences in content for sentence in sentences for word in simple_preprocess(str(sentence), deacc=True)]
    freq_dist = FreqDist(all_words)
    return freq_dist

In [5]:
words_freq = freq_words(data.data_lemmatized_freq.tolist())

In [6]:
most_frequent_words = set([word for word, v in words_freq.most_common(5)])
most_frequent_words

{'apartment', 'host', 'location', 'place', 'stay'}

In [7]:
def remove_most_freq_words(text):
    return [word for sentence in text for word in sentence.split() if word not in most_frequent_words]

In [8]:
data['data_lemmatized_no_freq'] = data.data_lemmatized_freq.progress_apply(lambda review: [remove_most_freq_words(sentence) 
                                                                                     for sentence in review])

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3557399.0), HTML(value='')))




### Sample the data set

In [14]:
def stratified_sample_df(df, col, n_samples):
    n = min(n_samples, df[col].value_counts().min())
    df_ = df.groupby(col).apply(lambda x: x.sample(n))
    df_.index = df_.index.droplevel(0)
    return df_

In [15]:
data_sample = stratified_sample_df(data, 'type', 1000000)
print(data_sample.shape)
data_sample.head()

(1780080, 30)


Unnamed: 0,id,date,comments,host_id,neighbourhood_cleansed,city,latitude,longitude,number_of_reviews,first_review,...,sentiment_from_rating,sentiment_reviews,sentiment_reviews_textblob,comments_to_sentences,sentiment_sentences,type,tokens,data_lemmatized,data_lemmatized_freq,data_lemmatized_no_freq
3420615,16431483,2020-01-08,The location is a 5mins walk to Oshiage Skytre...,42876350,Sumida Ku,Tokyo,35.71063,139.80844,269,2017-01-14,...,pos,0,1,[The location is a mins walk to Oshiage Skytre...,"[0, 0, 0, 0, 0, 0, 0]",Non-Western,"[[the, location, is, mins, walk, to, oshiage, ...","[[location, min, walk, oshiage, skytree, shop,...","[[location, min, walk, oshiage, skytree, shop,...","[[min, walk, oshiage, skytree, shop, mall, lot..."
2784917,13140800,2017-02-12,"This apartment is located near restaurants, ba...",73244585,Recoleta,Buenos Aires,-34.59463,-58.41043,39,2016-06-08,...,pos,1,1,[This apartment is located near restaurants ba...,"[0, 1, 0, 0, 0, 0]",Non-Western,"[[this, apartment, is, located, near, restaura...","[[apartment, locate, restaurant, bar, grocery,...","[[apartment, locate, restaurant, bar, grocery,...","[[locate, restaurant, bar, grocery, store, att..."
2903898,20716506,2019-01-03,"The place is spacious, clean, has bath tub, ai...",146408355,Islands,Hong Kong,22.24041,113.97741,7,2017-12-13,...,pos,1,1,[The place is spacious clean has bath tub airc...,"[0, 0, 0, 0, 0]",Non-Western,"[[the, place, is, spacious, clean, has, bath, ...","[[place, bath, need, watch, internet, subscrib...","[[place, bath, need, watch, internet, subscrib...","[[bath, need, watch, internet, subscriber, wat..."
2837425,163742,2017-01-28,it's good house. but it is third floor. That m...,304876,Central & Western,Hong Kong,22.28694,114.14855,224,2011-08-11,...,pos,1,1,"[it is good house., but it is third floor., Th...","[0, 0, 0, 0, 1, 0]",Non-Western,"[[it, is, good, house], [but, it, is, third, f...","[[house], [floor], [mean, lift, luggage], [wat...","[[house], [floor], [mean, lift, luggage], [wat...","[[house], [floor], [mean, lift, luggage], [wat..."
2750930,1777472,2017-04-15,The place is really nice! Great location. Clos...,9332066,Recoleta,Buenos Aires,-34.59053,-58.39445,80,2013-12-28,...,pos,1,1,"[The place is really nice., Great location., C...","[0, 1, 0, 1]",Non-Western,"[[the, place, is, really, nice], [great, locat...","[[place], [location], [subway, tourist, attrac...","[[place], [location], [subway, tourist, attrac...","[[], [], [subway, tourist, attraction], [condi..."


In [16]:
data_sample = [sentence for review in data_sample.data_lemmatized_no_freq.tolist() for sentence in review]

### DMM model

In [13]:
corpus = data_sample

nTopics=10

data_dmm = GSDMM.DMM(corpus, nTopics, iters=15) # Initialize the object, with default parameters.

data_dmm.topicAssigmentInitialise() # Performs the inital document assignments and counts
data_dmm.inference()

psi, theta, selected_psi, selected_theta = data_dmm.worddist() # Determines and stores the psi, theta and selected_psi and selected_theta values
   
finalAssignments = data_dmm.writeTopicAssignments() # Records the final topic assignments for the documents

coherence_topwords = data_dmm.writeTopTopicalWords(finalAssignments) # Record the top words for each document

score = data_dmm.coherence(coherence_topwords, len(finalAssignments)) #Calculates and stores the coherence

print('Final number of topics found: ' + str(len(finalAssignments)))

corpus=2499313, words=28883, K=10, a=0.100000, b=0.100000, nTopWords=10, iters=15
iteration: 0
iteration: 1
iteration: 2
iteration: 3
iteration: 4
iteration: 5
iteration: 6
iteration: 7
iteration: 8
iteration: 9
iteration: 10
iteration: 11
iteration: 12
iteration: 13
iteration: 14
[0 1 2 3 4 5 6 7 8 9]
area tip lot city time recommendation neighborhood thing information restaurant 
view room space city pool balcony area house day building 
time home experience thank trip communication hospitality family day house 
restaurant shop distance bar store lot area station neighborhood food 
room bed bathroom space kitchen shower bedroom water living area 
room space value family home people price time house money 
night noise street building room floor stair door parking day 
check communication time question response day arrival instruction message issue 
kitchen coffee breakfast water room machine towel touch tea amenity 
station minute bus subway walk train city line area tube 
average top

### GSDMM model

In [17]:
K = 10
mgp = MovieGroupProcess(K=10, alpha=0.1, beta=0.1, n_iters=30)
docs = data_sample
vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
y = mgp.fit(docs, n_terms)

In stage 0: transferred 6663137 clusters with 10 clusters populated
In stage 1: transferred 6592793 clusters with 10 clusters populated
In stage 2: transferred 6014619 clusters with 10 clusters populated
In stage 3: transferred 4894830 clusters with 10 clusters populated
In stage 4: transferred 4156553 clusters with 10 clusters populated
In stage 5: transferred 3736841 clusters with 10 clusters populated
In stage 6: transferred 3493261 clusters with 10 clusters populated
In stage 7: transferred 3385763 clusters with 10 clusters populated
In stage 8: transferred 3336538 clusters with 10 clusters populated
In stage 9: transferred 3310691 clusters with 10 clusters populated
In stage 10: transferred 3286571 clusters with 10 clusters populated
In stage 11: transferred 3268170 clusters with 10 clusters populated
In stage 12: transferred 3254776 clusters with 10 clusters populated
In stage 13: transferred 3243022 clusters with 10 clusters populated
In stage 14: transferred 3234130 clusters wi

In [18]:
doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topic :', doc_count)
print('*'*20)

fractions = (np.array(mgp.cluster_doc_count)*100. / sum(mgp.cluster_doc_count))
np.set_printoptions(precision=2)
print('% of documents per topic:', fractions)
print('*'*20)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[::-1]
print('Most important topics (by number of docs inside):', top_index)
print('*'*20)

Number of documents per topic : [ 459161  841578  961572  770719  359973  555043  810789 1011084 1233182
  418616]
********************
% of documents per topic: [ 6.19 11.34 12.96 10.38  4.85  7.48 10.92 13.62 16.62  5.64]
********************
Most important topics (by number of docs inside): [8 7 2 1 6 3 5 0 9 4]
********************


### Most frequent words in each topic

In [19]:
def top_words(cluster_word_distribution, top_index, num_words):
    for index in top_index:
        print('Topic {} '.format(index))
        print(list(sorted(mgp.cluster_word_distribution[index].items(), key=operator.itemgetter(1), reverse=True))[:num_words])
        print('*'*20)

In [20]:
top_words(mgp.cluster_word_distribution, top_index, 15)

Topic 8 
[('walk', 284680), ('station', 207149), ('restaurant', 196004), ('minute', 117339), ('locate', 92960), ('distance', 92588), ('shop', 92301), ('subway', 90008), ('area', 76462), ('train', 74239), ('bus', 68447), ('neighborhood', 62529), ('lot', 61648), ('bar', 60663), ('store', 59889)]
********************
Topic 7 
[('recommend', 317625), ('would', 317117), ('time', 69387), ('come', 64378), ('thank', 58587), ('visit', 53455), ('love', 48635), ('experience', 32684), ('enjoy', 32429), ('book', 31287), ('look', 28735), ('return', 27534), ('friend', 23978), ('hope', 20978), ('family', 20367)]
********************
Topic 2 
[('feel', 110465), ('thank', 101135), ('home', 98320), ('make', 93216), ('time', 45843), ('experience', 44538), ('family', 41720), ('enjoy', 36164), ('need', 34174), ('house', 28865), ('love', 27721), ('hospitality', 22957), ('room', 21968), ('space', 21406), ('meet', 19819)]
********************
Topic 1 
[('check', 175457), ('communication', 67069), ('question', 

### Assign topic name

In [36]:
topic_dict = {}
topic_names = ['Location',
               'Recommendation',
               'Experience',
               'Host',
               'Room',
               'Value',
               'View',
               'Advice',
               'Complaint',
               'Amenities']
for i, topic_num in enumerate(top_index):
    topic_dict[topic_num]=topic_names[i] 

In [37]:
topic_dict

{8: 'Location',
 7: 'Recommendation',
 2: 'Experience',
 1: 'Host',
 6: 'Room',
 3: 'Value',
 5: 'View',
 0: 'Advice',
 9: 'Complaint',
 4: 'Amenities'}

In [21]:
with open('model_mgp_all', 'wb') as f:
     pickle.dump(mgp, f)

In [22]:
import pickle
with open('model_mgp_all', 'rb') as f:
     mgp= pickle.load(f)

### Assign topic name  and probability of topic to each sentence

In [20]:
#def create_topics_dataframe(data_text=data,  mgp=mgp, threshold=0.4, topic_dict=topic_dict):
    result = pd.DataFrame(columns=['text', 'topic', 'topic_prob'])
    with tqdm(total=len(data_text)) as pbar:
        for i, text in enumerate(data_text):
            result.at[i, 'text'] = text
            prob = mgp.choose_best_label(data_text[i])
            if prob[1] >= threshold:
                result.at[i, 'topic'] = topic_dict[prob[0]]
                result.at[i, 'topic_prob'] = prob[1]
            else:
                if len(text) != 0:
                    result.at[i, 'topic'] = 'Other'
                    result.at[i, 'topic_prob'] = prob[1]
                else:
                    result.at[i, 'topic'] = []
                    result.at[i, 'topic_prob'] = None
            pbar.update(1)
        return result

In [38]:
def assign_topic(itext, threshold, topic_dict):
    i, text = itext
    prob = mgp.choose_best_label(text)

    if prob[1] >= threshold:
        topic = topic_dict[prob[0]]
        topic_prob = prob[1]
        return [i, text, topic, topic_prob]
    else:
        if len(text) != 0:
            topic = 'Other'
            topic_prob = prob[1]
            return [i, text, topic, topic_prob]
        else:
            topic = []
            topic_prob = None
            return [i, text, topic, topic_prob]

def create_topics_dataframe(data_text=data, mgp=mgp, threshold=0.4, topic_dict=topic_dict):
    assign_topic_for_text = partial(assign_topic, threshold=threshold, topic_dict=topic_dict)
    
    with Pool(multiprocessing.cpu_count() - 1) as pool:
        processed_data = list(tqdm(pool.imap(assign_topic_for_text, enumerate(data_text)), total=len(data_text)))
        result_data = sorted(processed_data, key=lambda row: row[0])
        result = pd.DataFrame(result_data, columns=['i', 'text', 'topic', 'topic_prob'])
        result.drop(['i'], axis=1, inplace=True)
        return result

In [39]:
data_lemmatized_list = [sentence for review in data.data_lemmatized_no_freq.tolist() for sentence in review]

In [40]:
gsdmm_output = create_topics_dataframe(data_lemmatized_list)

100%|██████████| 14491096/14491096 [42:56<00:00, 5623.77it/s] 


In [23]:
gsdmm_output.sort_values(by='topic_prob', ascending=False)

Unnamed: 0,text,topic,topic_prob
170,"[drawback, thing, owner, could, control, noise...",Problems/Issues/Complain,1
279,"[reviewer, note, loading, dock, shop, glaze, w...",Problems/Issues/Complain,1
29,"[walk, finsbury, park, tube, station, direct, ...",Transport,1
441,"[kitchen, dish, washer, wash, machine]",Apartment/Amenities,1
497,"[restaurant, pub, walk, distance, grocery, sto...",Location,1
...,...,...,...
923,[],[],
943,[],[],
944,[],[],
967,[],[],


In [35]:
print(gsdmm_output[gsdmm_output.topic=='Complaint'].sort_values(by='topic_prob', ascending=False)['text'].tolist()[:100])

[['reviewer', 'note', 'loading', 'dock', 'shop', 'glaze', 'window', 'curtain', 'bedroom', 'minimise', 'noise', 'light', 'problem', 'morning', 'wish', 'sleep'], ['drawback', 'thing', 'owner', 'could', 'control', 'noise', 'truck', 'unload', 'morning', 'window', 'close', 'result', 'lack', 'air', 'circulation', 'night'], ['room', 'bit', 'night', 'time', 'noise', 'fan', 'block', 'street', 'noise'], ['morning', 'wake', 'lot', 'delivery', 'truck', 'noise'], ['sleeper', 'bring', 'ear', 'plug'], ['bit', 'noise', 'day', 'construction', 'day', 'issue'], ['ground', 'floor', 'feel', 'people', 'street', 'could', 'see', 'end', 'issue', 'make', 'close', 'curtain', 'need'], ['note', 'lobby', 'indicate', 'neighbor', 'take', 'noise', 'family', 'group'], ['bedroom', 'face', 'street', 'night', 'pub', 'patron', 'way'], ['room', 'front', 'road', 'noise', 'bed', 'make', 'night', 'sleep'], ['entrance', 'road', 'set', 'noise', 'street'], ['construction', 'door', 'bother'], ['floor', 'need', 'luggage', 'stair', 

In [41]:
with open('gsdmm_output_all', 'wb') as f:
     pickle.dump(gsdmm_output, f)

In [3]:
with open('gsdmm_output', 'rb') as f:
     gsdmm_output = pickle.load(f)

### Plot topics

In [52]:
import pandas as pd
import pyLDAvis
import math

def prepare_data(mgp):
    vocabulary = list(vocab)
    doc_topic_dists = [mgp.score(doc) for doc in docs]
    for doc in doc_topic_dists:
        for f in doc:
            assert not isinstance(f, complex)

    doc_lengths = [len(doc) for doc in docs]
    term_counts_map = {}
    for doc in docs:
        for term in doc:
            term_counts_map[term] = term_counts_map.get(term, 0) + 1
    term_counts = [term_counts_map[term] for term in vocabulary]
    doc_topic_dists2 = [[v if not math.isnan(v) else 1/K for v in d] for d in doc_topic_dists]
    doc_topic_dists2 = [d if sum(d) > 0 else [1/K]*K for d in doc_topic_dists2]
    for doc in doc_topic_dists2:
        for f in doc:
            assert not isinstance(f, complex)
    
    assert (pd.DataFrame(doc_topic_dists2).sum(axis=1) < 0.999).sum() == 0
    matrix = []
    for cluster in mgp.cluster_word_distribution:
        total = sum([occurance for word, occurance in cluster.items()])
        assert not math.isnan(total)
        # assert total > 0
        if total == 0:
            row = [(1 / len(vocabulary))] * len(vocabulary)   # <--- The discussed workaround is here
        else:
            row = [cluster.get(term, 0) / total for term in vocabulary]
        for f in row:
            assert not isinstance(f, complex)
        matrix.append(row)
    return matrix, doc_topic_dists2, doc_lengths, vocabulary, term_counts

def prepare_visualization_data(mgp):
    vis_data = pyLDAvis.prepare(*prepare_data(mgp), sort_topics=False)
    return vis_data

vis_data = prepare_visualization_data(mgp)

%matplotlib inline
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_data)

  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
  log_lift = np.log(topic_term_dists / term_proportion)
  log_ttd = np.log(topic_term_dists)


### Merge sentence topic with data frame 

In [42]:
data['sentence_count'] = data.tokens.apply(lambda x: len(x))

In [43]:
data['first_sentence_index'] = data['sentence_count'].shift().cumsum().fillna(0).astype(int)

In [44]:
sentence_topics = gsdmm_output.topic.tolist()

In [45]:
data['sentence_topic'] = data[['first_sentence_index', 'sentence_count']]\
    .progress_apply(lambda row: sentence_topics[row['first_sentence_index'] : (row['first_sentence_index'] + row['sentence_count'])], axis = 1)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3557399.0), HTML(value='')))




In [46]:
sentence_topics_prob = gsdmm_output.topic_prob.tolist()

In [47]:
data['sentence_topic_prob'] = data[['first_sentence_index', 'sentence_count']]\
    .progress_apply(lambda row: sentence_topics_prob[row['first_sentence_index'] : (row['first_sentence_index'] + row['sentence_count'])], axis = 1)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=3557399.0), HTML(value='')))




In [48]:
data.head()

Unnamed: 0,id,date,comments,host_id,neighbourhood_cleansed,city,latitude,longitude,number_of_reviews,first_review,...,sentiment_sentences,type,tokens,data_lemmatized,data_lemmatized_freq,data_lemmatized_no_freq,sentence_count,first_sentence_index,sentence_topic,sentence_topic_prob
0,13913,2010-08-18,My girlfriend and I hadn't known Alina before ...,54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,"[0, 0, 1, 0, 1, 0, 0, 0, 0, 1]",Western,"[[my, girlfriend, and, had, not, known, alina,...","[[girlfriend, know, take, leap, faith, rent], ...","[[girlfriend, know, take, leap, faith, rent], ...","[[girlfriend, know, take, leap, faith, rent], ...",10,0,"[Experience, Other, Experience, Other, Locatio...","[0.9960194272226215, 0.1788041663197916, 0.511..."
1,13913,2011-07-11,Alina was a really good host. The flat is clea...,54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,"[0, 0, 0]",Western,"[[alina, was, really, good, host], [the, flat,...","[[host], [finsbury, park, station], [recommend]]","[[host], [finsbury, park, station], [recommend]]","[[], [finsbury, park, station], [recommend]]",3,10,"[[], Location, Recommendation]","[nan, 0.999460840070547, 0.8577023894096741]"
2,13913,2011-09-13,Alina is an amazing host. She made me feel rig...,54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,"[1, 0, 1, 1, 0, 0, 1]",Western,"[[alina, is, an, amazing, host], [she, made, m...","[[host], [make, feel, home], [hang, friend, st...","[[host], [make, feel, home], [hang, friend, st...","[[], [make, feel, home], [hang, friend, strang...",7,13,"[[], Experience, Experience, Amenities, Amenit...","[nan, 0.9942197962394913, 0.8663873241406135, ..."
3,13913,2011-10-03,"Alina's place is so nice, the room is big and ...",54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,"[1, 1, 0, 1]",Western,"[[alina, place, is, so, nice, the, room, is, b...","[[room, bed], [host, make, need, instance, put...","[[room, bed], [host, make, need, instance, put...","[[room, bed], [make, need, instance, put, towe...",4,20,"[Room, Amenities, Advice, Experience]","[0.928931220270995, 0.6324100040587486, 0.9323..."
4,13913,2011-10-09,"Nice location in Islington area, good for shor...",54730,Islington,London,51.56802,-0.11121,21,2010-08-18,...,"[1, 1]",Western,"[[nice, location, in, islington, area, good, f...","[[location, area, business, trip], [host]]","[[location, area, business, trip], [host]]","[[area, business, trip], []]",2,24,"[Value, []]","[0.6871687901497933, nan]"


In [49]:
data.comments_to_sentences[5]

['I am very happy to have been Alina s guest.',
 'we have had great time in London and enjoyed our stay.',
 'Alina is a great host we felt us so welcomed by her.',
 'Alina s house location is very convenient it is only min walk to Finsbury Park tube station and also a direct Picadilly line to Heathrow Airport in case yu have an early departure you can use the opportunity to sleep a bit in the train.',
 'The flat itself is very nice and clean and comfortable especially the double-bed with new mattress I slept like a newborn And also the red sofa on the small roof terrace is great I enjoyed the last night London sky.',
 'To all who is going to visit London I highly reccomend Alina and her beautiful house to stay in.',
 'Alina thank you so much and I hope to see you one day again']

In [50]:
data.sentence_topic[5]

['Other',
 'Recommendation',
 'Experience',
 'Location',
 'Room',
 'Recommendation',
 'Recommendation']

In [51]:
with open('data_with_topics_all', 'wb') as f:
    pickle.dump(data, f)