
# **Topic Modeling**
## **Introduction**






Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide



> (1) a document-term matrix and

> (2) the number of topics you would like the algorithm to pick up.






Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.


# NOTE: LDA takes VERY LONG TO RUN

**Importing the data**

In [1]:
import pandas as pd
import numpy as np
import pyprojroot.here as here

# Data Cleaning

In [2]:
# Load the dataset
df = pd.read_csv(here('data/processed/cleaned_mbs_total.csv'))
df.head(3)

Unnamed: 0.1,Unnamed: 0,traveller_username,review_title,review_text,travel_type,traveller_country_origin,traveller_total_contributions,traveller_total_helpful_contributions,rating,valid_rating,label,cleaned_review,combined_review,date,covid,year,stem_review,lem_review
0,0,Erica G,"Sick in Singapore, and MBS staff were amazing!","I was in Singapore on business and, unfortunat...",Trip type: Travelled on business,"Arlington Heights, Illinois",105.0,62.0,5.0,True,Positive,sick singapore mbs staff amazing singapore bus...,"Sick in Singapore, and MBS staff were amazing!...",2023-08-01,PostCovid,2023,sick singapor mb staff amaz singapor busi unfo...,sick singapore mbs staff amazing singapore bus...
1,1,HJay,Lovely place to go whatever you plan to do!,Whether it’s to soak up the Marina Bay citysca...,,"Perth, Australia",14.0,11.0,,False,,lovely place go whatever plan whether soak mar...,Lovely place to go whatever you plan to do! Wh...,2023-04-01,PostCovid,2023,love place go whatev plan whether soak marina ...,lovely place go whatever plan whether soak mar...
2,2,TaM,Thank you for the unforgettable memories,I stayed at Marina Bay Sands to propose to my ...,Trip type: Travelled as a couple,,1.0,,5.0,True,Positive,thank unforgettable memories stayed marina bay...,Thank you for the unforgettable memories I sta...,2023-09-01,PostCovid,2023,thank unforgett memori stay marina bay sand pr...,thank unforgettable memory stay marina bay san...


In [3]:
df[df.traveller_username == 'Jayne Jeong'].review_text.values

array([' As a remarkable landmark of Singapore, I recommend this hotel. As you know, it is known for its great infinity pool and definitely it is wonderful. And also the staffs are kind and professional. There are several Korean staffs and they are ready to help you. That is why I tried this hotel twice - first was 2016 and this time. A room condition was okay and met my expectation.   * 수영장 하나만으로도 값어치 한다고 생각합니다. 인피니티 풀에서 보는 뷰는 낮/밤 모두 환상적이니 꼭 챙겨보세요 :) '],
      dtype=object)

In [4]:
df.isnull().sum()

Unnamed: 0                                  0
traveller_username                          0
review_title                                0
review_text                                 0
travel_type                              4093
traveller_country_origin                 1615
traveller_total_contributions               5
traveller_total_helpful_contributions     582
rating                                      1
valid_rating                                0
label                                       1
cleaned_review                              0
combined_review                             0
date                                        0
covid                                       0
year                                        0
stem_review                                 0
lem_review                                  0
dtype: int64

### Create document-term matrix


In [5]:
df.shape

(10523, 18)

In [6]:
#import nltk
#from nltk.corpus import words
#from nltk.corpus import stopwords

# Tokenization and cleaning
#nltk.download('stopwords')

# Download the NLTK words corpus if you haven't already
#nltk.download('words')

# Get the list of valid English words
#english_words = set(words.words())

# set stopwords
#stop_words = set(stopwords.words('english'))

In [7]:
#def remove_stopwords(text):
#    words = text.split()  # Split the text into words
#    filtered_words = [word for word in words if word.lower() not in stop_words]
#    return ' '.join(filtered_words)

#def remove_non_english_words(text, valid_words):
#    words = text.split()
#    filtered_words = [word for word in words if word.lower() in valid_words]
#    return ' '.join(filtered_words)

In [8]:
#df['cleaned_review'] = df['cleaned_review'].apply(remove_stopwords)
#df['cleaned_review'] = df['cleaned_review'].apply(remove_non_english_words, valid_words=english_words)

# Topic Modeling with Good reviews (rating >= 4)

In [9]:
df_good_reviews= df.loc[df.rating>=4]
df_good_reviews.shape

(8787, 18)

In [10]:
# Create document-term matrix
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
data_cv = cv.fit_transform(df_good_reviews.cleaned_review)

In [11]:
data_cv.shape

(8787, 15713)

In [12]:
df_good_reviews.head()

Unnamed: 0.1,Unnamed: 0,traveller_username,review_title,review_text,travel_type,traveller_country_origin,traveller_total_contributions,traveller_total_helpful_contributions,rating,valid_rating,label,cleaned_review,combined_review,date,covid,year,stem_review,lem_review
0,0,Erica G,"Sick in Singapore, and MBS staff were amazing!","I was in Singapore on business and, unfortunat...",Trip type: Travelled on business,"Arlington Heights, Illinois",105.0,62.0,5.0,True,Positive,sick singapore mbs staff amazing singapore bus...,"Sick in Singapore, and MBS staff were amazing!...",2023-08-01,PostCovid,2023,sick singapor mb staff amaz singapor busi unfo...,sick singapore mbs staff amazing singapore bus...
2,2,TaM,Thank you for the unforgettable memories,I stayed at Marina Bay Sands to propose to my ...,Trip type: Travelled as a couple,,1.0,,5.0,True,Positive,thank unforgettable memories stayed marina bay...,Thank you for the unforgettable memories I sta...,2023-09-01,PostCovid,2023,thank unforgett memori stay marina bay sand pr...,thank unforgettable memory stay marina bay san...
3,3,Praxmeyer,Amazing hotel but not sure I’d do it again,We stayed one night in the hotel. The good par...,Trip type: Travelled as a couple,"Napier, New Zealand",444.0,232.0,4.0,True,Positive,amazing hotel sure stayed one night hotel good...,Amazing hotel but not sure I’d do it again We ...,2023-09-01,PostCovid,2023,amaz hotel sure stay one night hotel good part...,amazing hotel sure stay one night hotel good p...
4,4,TravelWriter74,Stunning hotel and overall an amazing experience,We have just returned from a super couple of d...,Trip type: Travelled with family,"London, United Kingdom",166.0,114.0,5.0,True,Positive,stunning hotel overall amazing experience retu...,Stunning hotel and overall an amazing experien...,2023-08-01,PostCovid,2023,stun hotel overal amaz experi return super cou...,stunning hotel overall amazing experience retu...
5,5,Ingo S,Perfect stay!,One of the best hotels we stayed! Great enviro...,Trip type: Travelled as a couple,"Ilmenau, Germany",3.0,,5.0,True,Positive,perfect stay one best hotels stayed great envi...,Perfect stay! One of the best hotels we stayed...,2023-09-01,PostCovid,2023,perfect stay one best hotel stay great environ...,perfect stay one good hotel stay great environ...


## Topic Modeling - Attempt #1 Complete Review Text

In [13]:
data = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data.index = df_good_reviews.index
data

Unnamed: 0,aaa,aaaah,aaamazing,aahhhh,aamzing,aarguable,aback,abandon,abandoned,abbreviate,...,zing,zip,zone,zones,zoo,zoom,zooming,zooms,zoos,zulfadli
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10518,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10519,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# Import the necessary modules for LDA with gensim
from gensim import matutils, models
import scipy.sparse

#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [15]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,0,2,3,4,5,6,7,9,10,11,...,10507,10508,10509,10510,10512,10518,10519,10520,10521,10522
aaa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaaah,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aaamazing,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aahhhh,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aamzing,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [17]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [18]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.024*"hotel" + 0.022*"pool" + 0.013*"room" + 0.011*"bay" + 0.009*"view" + 0.008*"singapore" + 0.008*"get" + 0.008*"one" + 0.007*"floor" + 0.006*"gardens"'),
 (1,
  '0.039*"hotel" + 0.032*"pool" + 0.022*"view" + 0.019*"amazing" + 0.018*"great" + 0.016*"singapore" + 0.016*"stay" + 0.014*"room" + 0.013*"bay" + 0.012*"infinity"'),
 (2,
  '0.032*"room" + 0.017*"check" + 0.013*"us" + 0.011*"staff" + 0.010*"club" + 0.010*"pool" + 0.009*"service" + 0.009*"stay" + 0.009*"hotel" + 0.007*"time"')]

In [19]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.031*"room" + 0.015*"club" + 0.011*"stay" + 0.010*"us" + 0.010*"staff" + 0.010*"hotel" + 0.010*"pool" + 0.009*"floor" + 0.009*"amazing" + 0.009*"view"'),
 (1,
  '0.025*"room" + 0.023*"pool" + 0.020*"hotel" + 0.015*"check" + 0.009*"view" + 0.008*"get" + 0.008*"good" + 0.007*"people" + 0.007*"one" + 0.007*"time"'),
 (2,
  '0.022*"mbs" + 0.016*"service" + 0.009*"staff" + 0.009*"stay" + 0.008*"experience" + 0.008*"like" + 0.006*"thank" + 0.006*"team" + 0.006*"hotel" + 0.005*"time"'),
 (3,
  '0.040*"hotel" + 0.031*"pool" + 0.020*"view" + 0.019*"singapore" + 0.017*"bay" + 0.015*"amazing" + 0.014*"great" + 0.013*"stay" + 0.012*"infinity" + 0.011*"marina"')]

In [20]:
# LDA for num_topics = 5
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=10)
lda.print_topics()

[(0,
  '0.048*"pool" + 0.043*"view" + 0.037*"hotel" + 0.022*"great" + 0.021*"room" + 0.019*"amazing" + 0.018*"good" + 0.016*"nice" + 0.014*"infinity" + 0.013*"stay"'),
 (1,
  '0.030*"room" + 0.020*"pool" + 0.019*"amazing" + 0.018*"club" + 0.015*"hotel" + 0.015*"stay" + 0.014*"view" + 0.012*"staff" + 0.012*"night" + 0.012*"views"'),
 (2,
  '0.042*"hotel" + 0.021*"pool" + 0.020*"singapore" + 0.020*"bay" + 0.013*"marina" + 0.010*"great" + 0.010*"shopping" + 0.010*"one" + 0.010*"stay" + 0.009*"sands"'),
 (3,
  '0.017*"staff" + 0.017*"service" + 0.017*"stay" + 0.016*"mbs" + 0.015*"us" + 0.014*"hotel" + 0.012*"room" + 0.009*"marina" + 0.009*"experience" + 0.008*"bay"'),
 (4,
  '0.027*"room" + 0.017*"pool" + 0.014*"check" + 0.012*"hotel" + 0.008*"get" + 0.007*"people" + 0.007*"one" + 0.006*"would" + 0.006*"floor" + 0.006*"time"')]

In [21]:
# LDA for num_topics = 6
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=6, passes=10)
lda.print_topics()

[(0,
  '0.026*"us" + 0.024*"mbs" + 0.022*"staff" + 0.021*"service" + 0.015*"stay" + 0.015*"birthday" + 0.012*"thank" + 0.009*"room" + 0.009*"staycation" + 0.009*"team"'),
 (1,
  '0.037*"room" + 0.019*"check" + 0.017*"club" + 0.012*"us" + 0.009*"floor" + 0.007*"pool" + 0.007*"pm" + 0.007*"breakfast" + 0.007*"suite" + 0.007*"th"'),
 (2,
  '0.038*"pool" + 0.033*"hotel" + 0.028*"view" + 0.025*"amazing" + 0.020*"room" + 0.019*"stay" + 0.018*"singapore" + 0.016*"great" + 0.016*"bay" + 0.014*"infinity"'),
 (3,
  '0.029*"hotel" + 0.026*"pool" + 0.018*"room" + 0.011*"good" + 0.010*"great" + 0.010*"view" + 0.008*"rooms" + 0.008*"people" + 0.008*"service" + 0.008*"stay"'),
 (4,
  '0.035*"hotel" + 0.022*"bay" + 0.020*"singapore" + 0.016*"shopping" + 0.013*"marina" + 0.012*"mall" + 0.012*"pool" + 0.011*"restaurants" + 0.010*"casino" + 0.009*"sands"'),
 (5,
  '0.015*"la" + 0.009*"vie" + 0.008*"ce" + 0.005*"carte" + 0.005*"wanna" + 0.004*"games" + 0.004*"hallways" + 0.004*"vi" + 0.004*"grt" + 0.003*"

## Topic Modeling - Attempt #2 (Nouns Only)

In [22]:
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import string
from nltk import word_tokenize, pos_tag

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ammarbagharib/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ammarbagharib/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [23]:
# Let's create a function to pull out nouns from a string of text
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [24]:
data_nouns = pd.DataFrame(df_good_reviews.cleaned_review.apply(nouns))
data_nouns

Unnamed: 0,cleaned_review
0,singapore staff singapore business room days r...
2,thank memories marina bay sands girlfriend sta...
3,hotel night hotel parts staff room view city c...
4,hotel experience couple days week holidays nig...
5,hotels environment service rooms kind addition...
...,...
10518,place year sg week year rooms city view price ...
10519,hotel view city pool mums hotel months mums ho...
10520,world service immaculate room bathroom balcony...
10521,days weeks hotels motels recover relax hotel g...


In [25]:
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer()
data_cvn = cvn.fit_transform(data_nouns.cleaned_review)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,aahhhh,abandon,abd,abdul,abhor,ability,abit,abraj,absence,absent,...,zhu,zi,zig,zip,zone,zones,zoo,zoom,zoos,zulfadli
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10518,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10519,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [27]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.081*"hotel" + 0.042*"pool" + 0.019*"place" + 0.017*"bay" + 0.017*"rooms" + 0.016*"infinity" + 0.016*"marina" + 0.015*"view" + 0.015*"singapore" + 0.015*"views"'),
 (1,
  '0.069*"room" + 0.049*"pool" + 0.043*"view" + 0.041*"hotel" + 0.021*"floor" + 0.019*"night" + 0.017*"city" + 0.014*"infinity" + 0.013*"breakfast" + 0.012*"views"'),
 (2,
  '0.038*"room" + 0.028*"staff" + 0.023*"hotel" + 0.023*"service" + 0.020*"check" + 0.019*"pool" + 0.016*"time" + 0.014*"mbs" + 0.012*"stay" + 0.011*"experience"')]

In [28]:
# Let's try topics = 4
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.054*"room" + 0.046*"hotel" + 0.039*"pool" + 0.019*"check" + 0.015*"staff" + 0.015*"time" + 0.014*"service" + 0.014*"view" + 0.013*"people" + 0.012*"rooms"'),
 (1,
  '0.066*"room" + 0.053*"view" + 0.046*"pool" + 0.029*"night" + 0.028*"hotel" + 0.024*"floor" + 0.024*"city" + 0.019*"club" + 0.016*"infinity" + 0.015*"staff"'),
 (2,
  '0.034*"hotel" + 0.030*"service" + 0.030*"place" + 0.024*"staff" + 0.023*"stay" + 0.018*"mbs" + 0.015*"experience" + 0.012*"everything" + 0.011*"time" + 0.010*"singapore"'),
 (3,
  '0.088*"hotel" + 0.049*"pool" + 0.030*"bay" + 0.030*"marina" + 0.023*"sands" + 0.023*"view" + 0.022*"infinity" + 0.017*"restaurants" + 0.017*"rooms" + 0.017*"singapore"')]

In [29]:
# Let's try topics = 5
ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.074*"hotel" + 0.063*"pool" + 0.056*"view" + 0.035*"room" + 0.027*"infinity" + 0.026*"bay" + 0.023*"marina" + 0.023*"night" + 0.023*"experience" + 0.021*"city"'),
 (1,
  '0.043*"service" + 0.029*"hotel" + 0.027*"staff" + 0.022*"mbs" + 0.021*"room" + 0.017*"stay" + 0.014*"team" + 0.014*"check" + 0.009*"thanks" + 0.009*"thank"'),
 (2,
  '0.074*"hotel" + 0.042*"pool" + 0.019*"room" + 0.017*"views" + 0.017*"rooms" + 0.014*"food" + 0.012*"people" + 0.011*"restaurants" + 0.010*"infinity" + 0.010*"floor"'),
 (3,
  '0.090*"room" + 0.027*"club" + 0.024*"view" + 0.022*"floor" + 0.019*"pool" + 0.019*"staff" + 0.019*"check" + 0.017*"breakfast" + 0.013*"service" + 0.013*"night"'),
 (4,
  '0.040*"pool" + 0.031*"room" + 0.025*"time" + 0.020*"view" + 0.017*"check" + 0.014*"hotel" + 0.013*"experience" + 0.013*"people" + 0.011*"bay" + 0.010*"card"')]

In [30]:
# Let's try topics = 6
ldan = models.LdaModel(corpus=corpusn, num_topics=6, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.050*"staff" + 0.033*"stay" + 0.032*"service" + 0.026*"family" + 0.018*"hotel" + 0.018*"pool" + 0.018*"place" + 0.017*"mbs" + 0.017*"time" + 0.016*"birthday"'),
 (1,
  '0.037*"service" + 0.037*"room" + 0.018*"staff" + 0.017*"hotel" + 0.015*"check" + 0.015*"mbs" + 0.014*"food" + 0.013*"experience" + 0.013*"marina" + 0.012*"sands"'),
 (2,
  '0.061*"list" + 0.042*"bucket" + 0.018*"staff" + 0.011*"check" + 0.011*"experience" + 0.007*"phone" + 0.007*"requests" + 0.006*"phones" + 0.006*"mbs" + 0.005*"tick"'),
 (3,
  '0.080*"hotel" + 0.060*"pool" + 0.035*"view" + 0.023*"infinity" + 0.021*"views" + 0.021*"experience" + 0.019*"bay" + 0.019*"rooms" + 0.018*"city" + 0.018*"place"'),
 (4,
  '0.083*"hotel" + 0.022*"bay" + 0.019*"marina" + 0.019*"pool" + 0.016*"sands" + 0.015*"room" + 0.012*"area" + 0.011*"rooms" + 0.010*"time" + 0.009*"restaurants"'),
 (5,
  '0.076*"room" + 0.045*"pool" + 0.033*"view" + 0.030*"hotel" + 0.021*"floor" + 0.018*"night" + 0.018*"check" + 0.014*"city" + 0.014*"s

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [31]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)]
    return ' '.join(nouns_adj)

In [32]:
data_nouns_adj = pd.DataFrame(df_good_reviews.cleaned_review.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,cleaned_review
0,sick singapore mbs staff singapore business st...
2,thank unforgettable memories marina bay sands ...
3,amazing hotel sure night hotel good parts staf...
4,hotel overall amazing experience super couple ...
5,perfect best hotels great environment great se...
...,...
10518,place new year sg last week new year rooms cit...
10519,magnificent hotel best view city pool mums hol...
10520,top world service immaculate room bathroom lar...
10521,days weeks different hotels motels recover rel...


In [33]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(max_df=.8) #, max_df is used for removing data values that appear too frequently, also known as "corpus-specific stop words".
# For example, max_df=.8 means "It ignores terms that appear in more than 80% of the documents".
data_cvna = cvna.fit_transform(data_nouns_adj.cleaned_review)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,aaa,aahhhh,aarguable,abandon,abbreviate,abd,abdul,abhor,ability,abit,...,zhu,zi,zig,zip,zone,zones,zoo,zoom,zoos,zulfadli
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10518,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10519,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10520,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10521,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [35]:
# Let's start with 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.050*"hotel" + 0.027*"pool" + 0.023*"bay" + 0.018*"view" + 0.017*"marina" + 0.016*"great" + 0.014*"place" + 0.014*"singapore" + 0.014*"infinity" + 0.013*"top"'),
 (1,
  '0.025*"room" + 0.022*"staff" + 0.019*"service" + 0.014*"mbs" + 0.012*"stay" + 0.011*"check" + 0.010*"hotel" + 0.010*"time" + 0.010*"club" + 0.008*"birthday"'),
 (2,
  '0.041*"room" + 0.037*"pool" + 0.034*"hotel" + 0.022*"view" + 0.015*"great" + 0.012*"floor" + 0.012*"night" + 0.010*"city" + 0.009*"infinity" + 0.009*"good"')]

In [36]:
# Let's start with 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.043*"hotel" + 0.043*"pool" + 0.028*"room" + 0.028*"view" + 0.020*"great" + 0.014*"night" + 0.013*"infinity" + 0.012*"city" + 0.012*"floor" + 0.012*"views"'),
 (1,
  '0.052*"room" + 0.023*"club" + 0.021*"check" + 0.012*"staff" + 0.011*"time" + 0.008*"breakfast" + 0.008*"service" + 0.008*"tower" + 0.008*"pm" + 0.007*"tea"'),
 (2,
  '0.027*"staff" + 0.024*"service" + 0.023*"stay" + 0.020*"mbs" + 0.019*"room" + 0.015*"great" + 0.014*"birthday" + 0.013*"experience" + 0.011*"time" + 0.011*"hotel"'),
 (3,
  '0.049*"hotel" + 0.030*"bay" + 0.026*"marina" + 0.020*"sands" + 0.016*"pool" + 0.016*"singapore" + 0.012*"place" + 0.010*"best" + 0.010*"experience" + 0.010*"infinity"')]

In [37]:
# Let's start with 5 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=5, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.055*"hotel" + 0.038*"pool" + 0.020*"great" + 0.019*"view" + 0.014*"infinity" + 0.014*"views" + 0.012*"good" + 0.012*"rooms" + 0.011*"top" + 0.011*"room"'),
 (1,
  '0.024*"service" + 0.018*"staff" + 0.017*"mbs" + 0.010*"stay" + 0.009*"family" + 0.008*"thank" + 0.008*"team" + 0.008*"good" + 0.006*"front" + 0.006*"staycation"'),
 (2,
  '0.053*"club" + 0.037*"room" + 0.024*"staff" + 0.018*"service" + 0.017*"breakfast" + 0.017*"suite" + 0.013*"tea" + 0.013*"lounge" + 0.012*"afternoon" + 0.012*"great"'),
 (3,
  '0.047*"room" + 0.041*"pool" + 0.023*"hotel" + 0.021*"view" + 0.017*"check" + 0.015*"time" + 0.015*"night" + 0.011*"great" + 0.011*"experience" + 0.010*"staff"'),
 (4,
  '0.041*"room" + 0.036*"bay" + 0.029*"marina" + 0.029*"hotel" + 0.023*"view" + 0.022*"sands" + 0.014*"pool" + 0.013*"floor" + 0.011*"stay" + 0.008*"great"')]

In [38]:
# Let's start with 6 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=6, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.052*"bay" + 0.042*"marina" + 0.032*"sands" + 0.016*"singapore" + 0.015*"view" + 0.012*"visit" + 0.012*"place" + 0.011*"show" + 0.007*"park" + 0.007*"sand"'),
 (1,
  '0.040*"bar" + 0.019*"la" + 0.015*"drinks" + 0.013*"top" + 0.013*"ce" + 0.009*"drink" + 0.009*"music" + 0.007*"visit" + 0.006*"cocktails" + 0.006*"forest"'),
 (2,
  '0.034*"room" + 0.031*"pool" + 0.025*"hotel" + 0.012*"check" + 0.011*"people" + 0.011*"view" + 0.011*"floor" + 0.010*"time" + 0.009*"night" + 0.008*"area"'),
 (3,
  '0.049*"hotel" + 0.040*"pool" + 0.031*"view" + 0.030*"room" + 0.024*"great" + 0.016*"bay" + 0.016*"infinity" + 0.015*"stay" + 0.014*"experience" + 0.014*"night"'),
 (4,
  '0.057*"hotel" + 0.025*"pool" + 0.021*"casino" + 0.019*"restaurants" + 0.017*"mall" + 0.016*"shopping" + 0.015*"great" + 0.014*"place" + 0.013*"food" + 0.013*"bay"'),
 (5,
  '0.041*"room" + 0.036*"club" + 0.024*"staff" + 0.022*"service" + 0.018*"suite" + 0.017*"mbs" + 0.013*"stay" + 0.012*"birthday" + 0.012*"check" + 0.011

## Identify Topics in Best Models

# Topic Modeling with Bad reviews (rating <= 2)

In [39]:
df_bad_reviews= df.loc[df.rating<=2]
df_bad_reviews.shape

(707, 18)

In [40]:
# Create document-term matrix
cv = CountVectorizer()
data_cv = cv.fit_transform(df_bad_reviews.cleaned_review)

In [41]:
data_cv.shape

(707, 7041)

## Topic Modeling - Attempt #1 Complete Review Text

In [42]:
data = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data.index = df_bad_reviews.index
data

Unnamed: 0,ab,aback,abandoned,abide,abiding,ability,abiut,able,abrasive,abrupt,...,yrs,yuk,yuri,zack,zara,zero,zilch,zircon,zone,zoo
31,0,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
33,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10441,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10458,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
10484,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10514,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,31,33,35,42,53,84,96,107,108,118,...,10349,10357,10387,10408,10424,10441,10458,10484,10514,10515
ab,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aback,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandoned,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abide,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abiding,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [45]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [46]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.023*"hotel" + 0.012*"pool" + 0.011*"room" + 0.010*"us" + 0.009*"service" + 0.007*"staff" + 0.006*"one" + 0.006*"check" + 0.005*"get" + 0.005*"stay"'),
 (1,
  '0.030*"hotel" + 0.018*"pool" + 0.011*"room" + 0.010*"service" + 0.009*"stay" + 0.009*"check" + 0.008*"like" + 0.007*"staff" + 0.006*"people" + 0.006*"singapore"'),
 (2,
  '0.026*"room" + 0.014*"hotel" + 0.010*"check" + 0.009*"pool" + 0.008*"one" + 0.008*"us" + 0.007*"service" + 0.007*"would" + 0.006*"staff" + 0.006*"get"')]

In [47]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.025*"hotel" + 0.018*"room" + 0.011*"pool" + 0.011*"service" + 0.009*"stay" + 0.008*"check" + 0.007*"like" + 0.007*"staff" + 0.007*"one" + 0.006*"would"'),
 (1,
  '0.018*"pool" + 0.018*"hotel" + 0.015*"room" + 0.008*"stay" + 0.008*"check" + 0.007*"us" + 0.006*"service" + 0.006*"view" + 0.006*"would" + 0.005*"one"'),
 (2,
  '0.018*"hotel" + 0.013*"check" + 0.013*"pool" + 0.008*"room" + 0.007*"staff" + 0.007*"one" + 0.006*"service" + 0.006*"us" + 0.006*"get" + 0.005*"time"'),
 (3,
  '0.025*"hotel" + 0.021*"room" + 0.013*"pool" + 0.009*"staff" + 0.007*"one" + 0.007*"us" + 0.007*"get" + 0.007*"service" + 0.006*"floor" + 0.006*"check"')]

In [48]:
# LDA for num_topics = 5
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=10)
lda.print_topics()

[(0,
  '0.024*"hotel" + 0.022*"pool" + 0.015*"room" + 0.008*"people" + 0.007*"stay" + 0.007*"service" + 0.006*"like" + 0.006*"get" + 0.005*"check" + 0.005*"experience"'),
 (1,
  '0.024*"hotel" + 0.021*"room" + 0.012*"service" + 0.012*"us" + 0.011*"check" + 0.010*"pool" + 0.009*"staff" + 0.007*"stay" + 0.006*"one" + 0.006*"night"'),
 (2,
  '0.028*"hotel" + 0.017*"room" + 0.012*"pool" + 0.010*"one" + 0.009*"check" + 0.008*"service" + 0.008*"stay" + 0.007*"staff" + 0.007*"like" + 0.006*"night"'),
 (3,
  '0.011*"hotel" + 0.009*"stay" + 0.007*"pool" + 0.007*"room" + 0.007*"like" + 0.005*"service" + 0.004*"marina" + 0.004*"food" + 0.004*"get" + 0.004*"even"'),
 (4,
  '0.012*"room" + 0.012*"hotel" + 0.011*"pool" + 0.010*"check" + 0.006*"service" + 0.006*"staff" + 0.006*"card" + 0.006*"rooms" + 0.006*"get" + 0.005*"view"')]

In [49]:
# LDA for num_topics = 6
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=6, passes=10)
lda.print_topics()

[(0,
  '0.028*"hotel" + 0.021*"pool" + 0.018*"room" + 0.010*"service" + 0.009*"stay" + 0.009*"check" + 0.007*"one" + 0.007*"get" + 0.006*"like" + 0.006*"night"'),
 (1,
  '0.021*"hotel" + 0.019*"pool" + 0.007*"room" + 0.006*"people" + 0.005*"stay" + 0.005*"view" + 0.005*"get" + 0.005*"even" + 0.005*"like" + 0.005*"night"'),
 (2,
  '0.014*"hotel" + 0.011*"room" + 0.006*"stay" + 0.006*"service" + 0.006*"check" + 0.005*"like" + 0.005*"pool" + 0.004*"hotels" + 0.004*"singapore" + 0.004*"staff"'),
 (3,
  '0.024*"hotel" + 0.014*"room" + 0.011*"us" + 0.010*"service" + 0.008*"pool" + 0.007*"check" + 0.007*"like" + 0.007*"would" + 0.007*"staff" + 0.006*"one"'),
 (4,
  '0.020*"room" + 0.011*"check" + 0.011*"hotel" + 0.010*"staff" + 0.008*"us" + 0.007*"would" + 0.006*"stay" + 0.006*"time" + 0.006*"one" + 0.006*"get"'),
 (5,
  '0.021*"room" + 0.020*"hotel" + 0.011*"pool" + 0.011*"check" + 0.009*"service" + 0.009*"staff" + 0.007*"one" + 0.007*"stay" + 0.007*"experience" + 0.007*"time"')]

## Topic Modeling - Attempt #2 (Nouns Only)

In [50]:
data_nouns = pd.DataFrame(df_bad_reviews.cleaned_review.apply(nouns))
data_nouns

Unnamed: 0,cleaned_review
31,check room reservations line check counter hou...
33,twin room floor level level twin room reservat...
35,way money waste place buffet heavy room food e...
42,money service rooms price level night access p...
53,time time marina sands time home raving enemy ...
...,...
10441,money marina hotel singapore skypark infinity ...
10458,service hotel hotel singapore case reputation ...
10484,star hotel hotel business event room rate star...
10514,honeymoon stay hotel dream year hotel delusion...


In [51]:
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer()
data_cvn = cvn.fit_transform(data_nouns.cleaned_review)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,ability,absence,absolute,absurd,abundance,abuse,ac,accept,access,accessibility,...,youngsters,yr,yuri,zack,zara,zero,zilch,zircon,zone,zoo
31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
53,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10441,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10458,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10484,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10514,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [53]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.050*"hotel" + 0.041*"room" + 0.018*"service" + 0.018*"check" + 0.017*"staff" + 0.013*"time" + 0.013*"pool" + 0.011*"experience" + 0.010*"view" + 0.009*"night"'),
 (1,
  '0.040*"hotel" + 0.028*"room" + 0.023*"pool" + 0.018*"service" + 0.013*"time" + 0.011*"night" + 0.011*"experience" + 0.009*"staff" + 0.008*"place" + 0.008*"star"'),
 (2,
  '0.047*"hotel" + 0.036*"room" + 0.035*"pool" + 0.019*"service" + 0.016*"staff" + 0.014*"people" + 0.013*"rooms" + 0.010*"night" + 0.008*"view" + 0.008*"check"')]

In [54]:
# Let's try topics = 4
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.052*"hotel" + 0.030*"pool" + 0.025*"room" + 0.021*"service" + 0.013*"people" + 0.012*"staff" + 0.011*"experience" + 0.011*"view" + 0.011*"marina" + 0.011*"night"'),
 (1,
  '0.041*"hotel" + 0.025*"room" + 0.020*"staff" + 0.019*"pool" + 0.018*"service" + 0.011*"check" + 0.011*"rooms" + 0.010*"time" + 0.009*"day" + 0.009*"card"'),
 (2,
  '0.056*"room" + 0.052*"hotel" + 0.023*"pool" + 0.020*"service" + 0.015*"check" + 0.015*"staff" + 0.013*"time" + 0.013*"night" + 0.012*"rooms" + 0.011*"view"'),
 (3,
  '0.030*"pool" + 0.019*"hotel" + 0.012*"room" + 0.010*"mbs" + 0.010*"infinity" + 0.009*"people" + 0.008*"view" + 0.008*"night" + 0.008*"stay" + 0.008*"experience"')]

In [55]:
# Let's try topics = 5
ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.042*"room" + 0.034*"hotel" + 0.019*"pool" + 0.012*"staff" + 0.012*"service" + 0.012*"people" + 0.010*"night" + 0.010*"time" + 0.010*"money" + 0.010*"club"'),
 (1,
  '0.040*"pool" + 0.032*"hotel" + 0.018*"room" + 0.016*"time" + 0.015*"rooms" + 0.014*"service" + 0.014*"check" + 0.012*"staff" + 0.009*"view" + 0.009*"stay"'),
 (2,
  '0.010*"bed" + 0.007*"experience" + 0.007*"bar" + 0.005*"room" + 0.005*"pax" + 0.005*"departure" + 0.004*"floor" + 0.004*"trip" + 0.004*"today" + 0.004*"mattress"'),
 (3,
  '0.059*"room" + 0.046*"hotel" + 0.029*"pool" + 0.021*"service" + 0.014*"check" + 0.013*"night" + 0.013*"view" + 0.013*"staff" + 0.010*"rooms" + 0.009*"time"'),
 (4,
  '0.066*"hotel" + 0.023*"service" + 0.021*"staff" + 0.018*"room" + 0.017*"pool" + 0.012*"experience" + 0.011*"check" + 0.010*"time" + 0.009*"guests" + 0.009*"night"')]

In [56]:
# Let's try topics = 6
ldan = models.LdaModel(corpus=corpusn, num_topics=6, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.034*"staff" + 0.025*"service" + 0.021*"time" + 0.019*"room" + 0.018*"hotel" + 0.012*"customer" + 0.010*"pool" + 0.009*"check" + 0.007*"experience" + 0.006*"mbs"'),
 (1,
  '0.036*"pool" + 0.030*"room" + 0.029*"hotel" + 0.015*"service" + 0.013*"check" + 0.012*"card" + 0.012*"night" + 0.010*"staff" + 0.009*"water" + 0.009*"view"'),
 (2,
  '0.049*"room" + 0.022*"pool" + 0.022*"hotel" + 0.019*"service" + 0.014*"night" + 0.010*"view" + 0.008*"check" + 0.007*"bed" + 0.007*"floor" + 0.007*"staff"'),
 (3,
  '0.010*"pool" + 0.009*"staffs" + 0.008*"floor" + 0.008*"mbs" + 0.008*"day" + 0.007*"staff" + 0.006*"valet" + 0.005*"time" + 0.005*"manager" + 0.005*"hotel"'),
 (4,
  '0.060*"hotel" + 0.029*"room" + 0.024*"pool" + 0.019*"service" + 0.017*"staff" + 0.015*"people" + 0.013*"check" + 0.011*"view" + 0.010*"guests" + 0.010*"time"'),
 (5,
  '0.053*"hotel" + 0.051*"room" + 0.024*"pool" + 0.019*"service" + 0.014*"rooms" + 0.014*"staff" + 0.012*"time" + 0.011*"night" + 0.011*"experience" + 0.

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [57]:
data_nouns_adj = pd.DataFrame(df_bad_reviews.cleaned_review.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,cleaned_review
31,arrived tried check room reservations long lin...
33,deluxe twin room low floor level level duxe tw...
35,way expensive money waste place buffet horribl...
42,worth money iconic service outstanding rooms n...
53,second time wont second time marina sands firs...
...,...
10441,classy worth money marina bay hotel icon singa...
10458,service hotel finest hotel singapore case repu...
10484,worst star hotel stayed hotel business event r...
10514,bad honeymoon stay hotel dream year hotel big ...


In [58]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(max_df=.8) #, max_df is used for removing data values that appear too frequently, also known as "corpus-specific stop words".
# For example, max_df=.8 means "It ignores terms that appear in more than 80% of the documents".
data_cvna = cvna.fit_transform(data_nouns_adj.cleaned_review)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,ab,abide,ability,able,abrasive,abrupt,absence,absolute,absurd,abundance,...,youngsters,yr,yuri,zack,zara,zero,zilch,zircon,zone,zoo
31,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10441,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10458,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10484,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10514,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [59]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [60]:
# Let's start with 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.039*"hotel" + 0.020*"room" + 0.019*"pool" + 0.013*"service" + 0.010*"staff" + 0.010*"rooms" + 0.008*"view" + 0.008*"people" + 0.008*"check" + 0.008*"stay"'),
 (1,
  '0.031*"hotel" + 0.028*"room" + 0.019*"pool" + 0.016*"service" + 0.010*"staff" + 0.009*"check" + 0.009*"time" + 0.009*"night" + 0.007*"stay" + 0.007*"star"'),
 (2,
  '0.031*"room" + 0.026*"hotel" + 0.018*"pool" + 0.013*"staff" + 0.008*"service" + 0.008*"card" + 0.007*"key" + 0.007*"check" + 0.007*"time" + 0.005*"stay"')]

In [61]:
# Let's start with 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.028*"hotel" + 0.019*"room" + 0.016*"staff" + 0.011*"service" + 0.011*"pool" + 0.010*"check" + 0.009*"rooms" + 0.008*"time" + 0.006*"stay" + 0.006*"mbs"'),
 (1,
  '0.041*"hotel" + 0.030*"room" + 0.026*"pool" + 0.015*"service" + 0.009*"staff" + 0.009*"view" + 0.008*"stay" + 0.008*"experience" + 0.008*"check" + 0.008*"people"'),
 (2,
  '0.015*"hotel" + 0.011*"room" + 0.009*"check" + 0.008*"mbs" + 0.007*"experience" + 0.006*"staff" + 0.006*"night" + 0.005*"stay" + 0.005*"time" + 0.005*"line"'),
 (3,
  '0.025*"room" + 0.016*"hotel" + 0.013*"service" + 0.012*"night" + 0.010*"pool" + 0.008*"staff" + 0.006*"time" + 0.006*"guests" + 0.006*"poor" + 0.005*"card"')]

In [62]:
# Let's start with 5 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=5, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.027*"hotel" + 0.018*"service" + 0.017*"pool" + 0.016*"room" + 0.012*"staff" + 0.010*"time" + 0.008*"check" + 0.008*"night" + 0.006*"good" + 0.006*"rooms"'),
 (1,
  '0.043*"hotel" + 0.025*"pool" + 0.023*"room" + 0.013*"service" + 0.013*"staff" + 0.011*"stay" + 0.010*"night" + 0.009*"people" + 0.009*"rooms" + 0.008*"check"'),
 (2,
  '0.037*"room" + 0.027*"hotel" + 0.013*"service" + 0.010*"pool" + 0.010*"staff" + 0.007*"time" + 0.006*"rooms" + 0.006*"guests" + 0.006*"bay" + 0.006*"mbs"'),
 (3,
  '0.036*"room" + 0.020*"hotel" + 0.018*"pool" + 0.009*"check" + 0.009*"service" + 0.008*"experience" + 0.007*"time" + 0.007*"view" + 0.007*"rooms" + 0.007*"mbs"'),
 (4,
  '0.031*"hotel" + 0.020*"room" + 0.018*"pool" + 0.012*"service" + 0.011*"experience" + 0.010*"check" + 0.008*"staff" + 0.008*"time" + 0.007*"card" + 0.006*"star"')]

In [63]:
# Let's start with 6 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=6, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.028*"hotel" + 0.026*"pool" + 0.020*"room" + 0.013*"rooms" + 0.013*"people" + 0.012*"staff" + 0.009*"service" + 0.009*"great" + 0.008*"experience" + 0.007*"good"'),
 (1,
  '0.004*"sunday" + 0.004*"food" + 0.004*"cigarett" + 0.004*"weekend" + 0.003*"places" + 0.003*"front" + 0.003*"departure" + 0.003*"pre" + 0.003*"talk" + 0.003*"software"'),
 (2,
  '0.036*"hotel" + 0.021*"pool" + 0.021*"room" + 0.014*"service" + 0.008*"rooms" + 0.007*"breakfast" + 0.007*"good" + 0.006*"view" + 0.006*"price" + 0.006*"night"'),
 (3,
  '0.049*"hotel" + 0.030*"room" + 0.018*"pool" + 0.017*"service" + 0.012*"staff" + 0.011*"check" + 0.010*"time" + 0.009*"night" + 0.008*"stay" + 0.006*"star"'),
 (4,
  '0.028*"room" + 0.021*"hotel" + 0.021*"pool" + 0.012*"staff" + 0.011*"service" + 0.009*"time" + 0.009*"check" + 0.009*"stay" + 0.009*"view" + 0.008*"floor"'),
 (5,
  '0.023*"room" + 0.021*"hotel" + 0.013*"service" + 0.011*"pool" + 0.010*"staff" + 0.009*"night" + 0.008*"star" + 0.008*"experience" + 0.00

## Identify Topics in Best Models

to