
# **Topic Modeling**
## **Introduction**






Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide



> (1) a document-term matrix and

> (2) the number of topics you would like the algorithm to pick up.






Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.


# NOTE: LDA takes VERY LONG TO RUN

**Importing the data**

In [1]:
import pandas as pd
import numpy as np
import pyprojroot.here as here

# Data Cleaning

In [2]:
# Load the dataset
df = pd.read_csv(here('data/cleaned/cleaned_mbs_total.csv'))
df.head(3)

Unnamed: 0.1,Unnamed: 0,date_of_stay,traveller_username,review_title,review_text,travel_type,traveller_country_origin,traveller_total_contributions,traveller_total_helpful_contributions,rating1,rating2,rating,valid_rating,label,cleaned_review,combined_review,date,covid
0,0,Date of stay: August 2023,Erica G,"Sick in Singapore, and MBS staff were amazing!","I was in Singapore on business and, unfortunat...",Trip type: Travelled on business,"Arlington Heights, Illinois",105.0,62.0,5.0,,5.0,True,Positive,sick singapore mbs staff amazing I singapore b...,"Sick in Singapore, and MBS staff were amazing!...",2023-08-01,PostCovid
1,1,Date of stay: April 2023,HJay,Lovely place to go whatever you plan to do!,Whether it’s to soak up the Marina Bay citysca...,,"Perth, Australia",14.0,11.0,,,,False,,lovely place go whatever plan whether soak mar...,Lovely place to go whatever you plan to do! Wh...,2023-04-01,PostCovid
2,2,Date of stay: September 2023,TaM,Thank you for the unforgettable memories,I stayed at Marina Bay Sands to propose to my ...,Trip type: Travelled as a couple,,1.0,,5.0,,5.0,True,Positive,thank unforgettable memory I stay marina bay s...,Thank you for the unforgettable memories I sta...,2023-09-01,PostCovid


In [3]:
df[df.traveller_username == 'Jayne Jeong'].review_text.values

array([' As a remarkable landmark of Singapore, I recommend this hotel. As you know, it is known for its great infinity pool and definitely it is wonderful. And also the staffs are kind and professional. There are several Korean staffs and they are ready to help you. That is why I tried this hotel twice - first was 2016 and this time. A room condition was okay and met my expectation.   * 수영장 하나만으로도 값어치 한다고 생각합니다. 인피니티 풀에서 보는 뷰는 낮/밤 모두 환상적이니 꼭 챙겨보세요 :) '],
      dtype=object)

In [4]:
df.isnull().sum()

Unnamed: 0                                   0
date_of_stay                               549
traveller_username                           0
review_title                                 1
review_text                                  0
travel_type                               7173
traveller_country_origin                  2134
traveller_total_contributions                6
traveller_total_helpful_contributions     1316
rating1                                   4628
rating2                                  14744
rating                                       1
valid_rating                                 0
label                                        1
cleaned_review                               0
combined_review                              0
date                                       549
covid                                      549
dtype: int64

### Create document-term matrix


In [5]:
df.shape

(19371, 18)

In [6]:
import nltk
from nltk.corpus import words
from nltk.corpus import stopwords

# Tokenization and cleaning
nltk.download('stopwords')

# Download the NLTK words corpus if you haven't already
nltk.download('words')

# Get the list of valid English words
english_words = set(words.words())

# set stopwords
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ammarbagharib/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/ammarbagharib/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [7]:
def remove_stopwords(text):
    words = text.split()  # Split the text into words
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

def remove_non_english_words(text, valid_words):
    words = text.split()
    filtered_words = [word for word in words if word.lower() in valid_words]
    return ' '.join(filtered_words)

In [8]:
df['cleaned_review'] = df['cleaned_review'].apply(remove_stopwords)
df['cleaned_review'] = df['cleaned_review'].apply(remove_non_english_words, valid_words=english_words)

# Topic Modeling with Good reviews (rating >= 4)

In [9]:
df_good_reviews= df.loc[df.rating>=4]
df_good_reviews.shape

(15508, 18)

In [10]:
# Create document-term matrix
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
data_cv = cv.fit_transform(df_good_reviews.cleaned_review)

In [11]:
data_cv.shape

(15508, 9912)

In [12]:
df_good_reviews.head()

Unnamed: 0.1,Unnamed: 0,date_of_stay,traveller_username,review_title,review_text,travel_type,traveller_country_origin,traveller_total_contributions,traveller_total_helpful_contributions,rating1,rating2,rating,valid_rating,label,cleaned_review,combined_review,date,covid
0,0,Date of stay: August 2023,Erica G,"Sick in Singapore, and MBS staff were amazing!","I was in Singapore on business and, unfortunat...",Trip type: Travelled on business,"Arlington Heights, Illinois",105.0,62.0,5.0,,5.0,True,Positive,sick staff amazing business unfortunately end ...,"Sick in Singapore, and MBS staff were amazing!...",2023-08-01,PostCovid
2,2,Date of stay: September 2023,TaM,Thank you for the unforgettable memories,I stayed at Marina Bay Sands to propose to my ...,Trip type: Travelled as a couple,,1.0,,5.0,,5.0,True,Positive,thank unforgettable memory stay marina bay san...,Thank you for the unforgettable memories I sta...,2023-09-01,PostCovid
3,3,Date of stay: September 2023,Praxmeyer,Amazing hotel but not sure I’d do it again,We stayed one night in the hotel. The good par...,Trip type: Travelled as a couple,"Napier, New Zealand",444.0,232.0,4.0,,4.0,True,Positive,amazing hotel sure stay one night hotel good p...,Amazing hotel but not sure I’d do it again We ...,2023-09-01,PostCovid
4,4,Date of stay: August 2023,TravelWriter74,Stunning hotel and overall an amazing experience,We have just returned from a super couple of d...,Trip type: Travelled with family,"London, United Kingdom",166.0,114.0,,5.0,5.0,True,Positive,stunning hotel overall amazing experience retu...,Stunning hotel and overall an amazing experien...,2023-08-01,PostCovid
5,5,Date of stay: September 2023,Ingo S,Perfect stay!,One of the best hotels we stayed! Great enviro...,Trip type: Travelled as a couple,"Ilmenau, Germany",3.0,,5.0,,5.0,True,Positive,perfect stay one good hotel stay great environ...,Perfect stay! One of the best hotels we stayed...,2023-09-01,PostCovid


## Topic Modeling - Attempt #1 Complete Review Text

In [13]:
data = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data.index = df_good_reviews.index
data

Unnamed: 0,aa,aback,abandon,abandonment,abas,abbreviate,abhor,ability,able,abnormal,...,yuck,yummy,zee,zenith,zero,zig,zip,zone,zoo,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19339,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19348,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
19350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19358,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# Import the necessary modules for LDA with gensim
from gensim import matutils, models
import scipy.sparse

#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [15]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,0,2,3,4,5,6,7,9,10,11,...,19323,19325,19329,19335,19336,19339,19348,19350,19358,19367
aa,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aback,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandonment,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abas,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [17]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [18]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.042*"hotel" + 0.022*"pool" + 0.018*"room" + 0.012*"good" + 0.011*"view" + 0.010*"stay" + 0.009*"get" + 0.009*"one" + 0.008*"like" + 0.008*"great"'),
 (1,
  '0.037*"view" + 0.036*"hotel" + 0.033*"pool" + 0.032*"stay" + 0.024*"room" + 0.022*"bay" + 0.021*"amazing" + 0.018*"great" + 0.017*"marina" + 0.017*"good"'),
 (2,
  '0.038*"room" + 0.023*"check" + 0.014*"stay" + 0.012*"service" + 0.012*"staff" + 0.011*"hotel" + 0.010*"club" + 0.010*"get" + 0.009*"time" + 0.007*"pool"')]

In [19]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.031*"room" + 0.022*"stay" + 0.022*"check" + 0.020*"service" + 0.019*"staff" + 0.017*"hotel" + 0.012*"club" + 0.010*"time" + 0.010*"experience" + 0.009*"make"'),
 (1,
  '0.050*"room" + 0.036*"pool" + 0.035*"view" + 0.020*"hotel" + 0.020*"stay" + 0.018*"great" + 0.014*"good" + 0.014*"night" + 0.014*"floor" + 0.012*"breakfast"'),
 (2,
  '0.024*"hotel" + 0.015*"room" + 0.012*"get" + 0.011*"check" + 0.009*"one" + 0.009*"take" + 0.008*"go" + 0.008*"tower" + 0.008*"guest" + 0.007*"like"'),
 (3,
  '0.050*"hotel" + 0.028*"pool" + 0.026*"stay" + 0.024*"view" + 0.020*"bay" + 0.018*"good" + 0.016*"marina" + 0.015*"amazing" + 0.014*"great" + 0.013*"room"')]

In [20]:
# LDA for num_topics = 5
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=10)
lda.print_topics()

[(0,
  '0.049*"view" + 0.039*"pool" + 0.033*"room" + 0.023*"stay" + 0.021*"night" + 0.020*"floor" + 0.019*"hotel" + 0.017*"amazing" + 0.017*"city" + 0.016*"great"'),
 (1,
  '0.036*"room" + 0.022*"check" + 0.014*"get" + 0.012*"hotel" + 0.009*"take" + 0.008*"go" + 0.008*"guest" + 0.007*"time" + 0.007*"one" + 0.007*"bed"'),
 (2,
  '0.057*"hotel" + 0.034*"pool" + 0.033*"stay" + 0.026*"room" + 0.023*"good" + 0.018*"view" + 0.017*"great" + 0.015*"service" + 0.015*"experience" + 0.014*"amazing"'),
 (3,
  '0.044*"hotel" + 0.025*"bay" + 0.021*"marina" + 0.020*"shopping" + 0.019*"casino" + 0.017*"mall" + 0.015*"restaurant" + 0.013*"good" + 0.011*"food" + 0.011*"shop"'),
 (4,
  '0.030*"room" + 0.022*"staff" + 0.022*"service" + 0.021*"stay" + 0.020*"club" + 0.020*"check" + 0.011*"hotel" + 0.011*"make" + 0.010*"suite" + 0.008*"time"')]

In [21]:
# LDA for num_topics = 6
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=6, passes=10)
lda.print_topics()

[(0,
  '0.049*"room" + 0.033*"bed" + 0.021*"bathroom" + 0.016*"view" + 0.013*"shower" + 0.011*"bath" + 0.011*"de" + 0.010*"ta" + 0.009*"good" + 0.009*"floor"'),
 (1,
  '0.044*"hotel" + 0.019*"shopping" + 0.017*"bay" + 0.016*"mall" + 0.016*"casino" + 0.015*"restaurant" + 0.013*"view" + 0.010*"good" + 0.010*"shop" + 0.010*"marina"'),
 (2,
  '0.047*"room" + 0.046*"hotel" + 0.044*"pool" + 0.043*"view" + 0.035*"stay" + 0.033*"great" + 0.027*"good" + 0.021*"amazing" + 0.018*"service" + 0.014*"nice"'),
 (3,
  '0.025*"room" + 0.025*"service" + 0.024*"staff" + 0.021*"stay" + 0.017*"check" + 0.016*"hotel" + 0.011*"make" + 0.010*"experience" + 0.009*"suite" + 0.008*"time"'),
 (4,
  '0.033*"hotel" + 0.030*"pool" + 0.030*"stay" + 0.028*"bay" + 0.026*"view" + 0.025*"marina" + 0.019*"amazing" + 0.018*"night" + 0.016*"sand" + 0.015*"one"'),
 (5,
  '0.032*"room" + 0.020*"hotel" + 0.020*"check" + 0.018*"pool" + 0.017*"get" + 0.011*"go" + 0.009*"stay" + 0.009*"time" + 0.009*"take" + 0.008*"one"')]

## Topic Modeling - Attempt #2 (Nouns Only)

In [22]:
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import string
from nltk import word_tokenize, pos_tag

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ammarbagharib/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ammarbagharib/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [23]:
# Let's create a function to pull out nouns from a string of text
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [24]:
data_nouns = pd.DataFrame(df_good_reviews.cleaned_review.apply(nouns))
data_nouns

Unnamed: 0,cleaned_review
0,staff business stay room day result intimate r...
2,thank memory marina bay sand communicate staff...
3,hotel night hotel part staff room view city co...
4,hotel experience return couple day travel week...
5,hotel stay environment service room kind addit...
...,...
19339,sky park view view today moment walk photo poo...
19348,pool bare room leisure hotel reason pool view ...
19350,luxury seat area check room compare hotel look...
19358,hotel service hotel minute airport minute sin ...


In [25]:
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer()
data_cvn = cvn.fit_transform(data_nouns.cleaned_review)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,aback,abandon,abhor,ability,aboard,abound,absence,absent,absolute,absorb,...,yuan,yuck,yummy,zee,zenith,zero,zip,zone,zoo,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19339,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19348,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19358,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [27]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.084*"room" + 0.032*"check" + 0.022*"hotel" + 0.017*"club" + 0.016*"view" + 0.015*"time" + 0.015*"pool" + 0.015*"floor" + 0.013*"service" + 0.013*"staff"'),
 (1,
  '0.070*"hotel" + 0.040*"service" + 0.034*"pool" + 0.034*"staff" + 0.034*"stay" + 0.033*"room" + 0.031*"experience" + 0.021*"bay" + 0.021*"marina" + 0.018*"sand"'),
 (2,
  '0.074*"hotel" + 0.053*"pool" + 0.053*"view" + 0.036*"room" + 0.021*"bay" + 0.019*"city" + 0.019*"night" + 0.018*"floor" + 0.016*"infinity" + 0.014*"place"')]

In [28]:
# Let's try topics = 4
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.083*"room" + 0.025*"check" + 0.025*"hotel" + 0.023*"view" + 0.019*"floor" + 0.017*"pool" + 0.013*"bathroom" + 0.013*"area" + 0.013*"time" + 0.011*"tower"'),
 (1,
  '0.049*"room" + 0.040*"service" + 0.036*"club" + 0.035*"staff" + 0.034*"hotel" + 0.021*"stay" + 0.019*"check" + 0.017*"time" + 0.016*"experience" + 0.012*"book"'),
 (2,
  '0.084*"pool" + 0.073*"hotel" + 0.066*"view" + 0.063*"room" + 0.036*"night" + 0.031*"stay" + 0.024*"infinity" + 0.023*"experience" + 0.021*"city" + 0.018*"floor"'),
 (3,
  '0.081*"hotel" + 0.038*"bay" + 0.031*"marina" + 0.027*"pool" + 0.026*"view" + 0.021*"casino" + 0.021*"sand" + 0.019*"restaurant" + 0.017*"place" + 0.017*"food"')]

In [29]:
# Let's try topics = 5
ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.057*"hotel" + 0.047*"view" + 0.037*"pool" + 0.030*"room" + 0.020*"floor" + 0.020*"food" + 0.020*"restaurant" + 0.018*"city" + 0.016*"night" + 0.016*"garden"'),
 (1,
  '0.082*"room" + 0.031*"check" + 0.022*"hotel" + 0.021*"club" + 0.016*"view" + 0.016*"time" + 0.015*"staff" + 0.015*"service" + 0.015*"pool" + 0.013*"floor"'),
 (2,
  '0.178*"bay" + 0.174*"marina" + 0.125*"sand" + 0.019*"visit" + 0.015*"place" + 0.011*"sky" + 0.011*"park" + 0.009*"deck" + 0.008*"time" + 0.008*"world"'),
 (3,
  '0.068*"hotel" + 0.057*"service" + 0.043*"staff" + 0.035*"experience" + 0.031*"stay" + 0.020*"time" + 0.017*"business" + 0.016*"family" + 0.014*"property" + 0.013*"pool"'),
 (4,
  '0.109*"hotel" + 0.081*"pool" + 0.061*"room" + 0.057*"view" + 0.031*"stay" + 0.030*"night" + 0.023*"experience" + 0.023*"infinity" + 0.020*"service" + 0.018*"city"')]

In [30]:
# Let's try topics = 6
ldan = models.LdaModel(corpus=corpusn, num_topics=6, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.055*"room" + 0.046*"club" + 0.040*"hotel" + 0.034*"view" + 0.032*"pool" + 0.028*"night" + 0.024*"breakfast" + 0.023*"stay" + 0.018*"food" + 0.016*"floor"'),
 (1,
  '0.103*"hotel" + 0.047*"pool" + 0.032*"view" + 0.027*"room" + 0.019*"casino" + 0.019*"place" + 0.018*"restaurant" + 0.017*"food" + 0.014*"infinity" + 0.014*"service"'),
 (2,
  '0.127*"marina" + 0.119*"bay" + 0.092*"sand" + 0.045*"hotel" + 0.038*"experience" + 0.031*"stay" + 0.027*"staff" + 0.027*"service" + 0.020*"place" + 0.017*"time"'),
 (3,
  '0.089*"room" + 0.069*"view" + 0.062*"pool" + 0.033*"hotel" + 0.031*"night" + 0.031*"floor" + 0.026*"city" + 0.019*"stay" + 0.018*"infinity" + 0.016*"garden"'),
 (4,
  '0.067*"room" + 0.044*"check" + 0.036*"hotel" + 0.032*"service" + 0.026*"staff" + 0.021*"time" + 0.010*"stay" + 0.010*"book" + 0.009*"day" + 0.009*"people"'),
 (5,
  '0.041*"birthday" + 0.031*"property" + 0.030*"business" + 0.025*"suite" + 0.021*"year" + 0.018*"family" + 0.017*"weekend" + 0.017*"cake" + 0.016

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [31]:
# Let's create a function to pull out nouns from a string of text
def nouns_adj(text):
    '''Given a string of text, tokenize the text and pull out only the nouns and adjectives.'''
    is_noun_adj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nouns_adj = [word for (word, pos) in pos_tag(tokenized) if is_noun_adj(pos)]
    return ' '.join(nouns_adj)

In [32]:
data_nouns_adj = pd.DataFrame(df_good_reviews.cleaned_review.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,cleaned_review
0,sick staff business end stay room several day ...
2,thank unforgettable memory stay marina bay san...
3,amazing hotel sure night hotel good part staff...
4,hotel overall amazing experience return super ...
5,perfect good hotel stay great environment grea...
...,...
19339,sky park top view nice view today top moment o...
19348,awesome pool bare room leisure guest stay hote...
19350,luxury comfortable seat area check room big co...
19358,outstanding hotel exceptional service new hote...


In [33]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(max_df=.8) #, max_df is used for removing data values that appear too frequently, also known as "corpus-specific stop words".
# For example, max_df=.8 means "It ignores terms that appear in more than 80% of the documents".
data_cvna = cvna.fit_transform(data_nouns_adj.cleaned_review)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,aa,aback,abandon,abandonment,abas,abbreviate,abhor,ability,able,abnormal,...,yuan,yuck,yummy,zee,zenith,zero,zip,zone,zoo,zoom
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19339,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19348,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
19350,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19358,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [35]:
# Let's start with 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.048*"room" + 0.022*"check" + 0.016*"hotel" + 0.015*"service" + 0.014*"staff" + 0.012*"time" + 0.010*"stay" + 0.010*"club" + 0.008*"book" + 0.007*"guest"'),
 (1,
  '0.050*"hotel" + 0.048*"pool" + 0.044*"room" + 0.042*"view" + 0.023*"great" + 0.020*"good" + 0.020*"stay" + 0.018*"night" + 0.015*"floor" + 0.014*"city"'),
 (2,
  '0.048*"hotel" + 0.037*"bay" + 0.036*"marina" + 0.023*"sand" + 0.016*"good" + 0.016*"stay" + 0.015*"place" + 0.014*"experience" + 0.012*"pool" + 0.011*"service"')]

In [36]:
# Let's start with 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.038*"room" + 0.028*"staff" + 0.028*"service" + 0.023*"check" + 0.021*"hotel" + 0.019*"stay" + 0.014*"time" + 0.011*"experience" + 0.009*"club" + 0.009*"great"'),
 (1,
  '0.079*"bay" + 0.059*"marina" + 0.040*"view" + 0.037*"sand" + 0.023*"pool" + 0.020*"garden" + 0.018*"night" + 0.015*"hotel" + 0.014*"city" + 0.013*"visit"'),
 (2,
  '0.077*"hotel" + 0.040*"pool" + 0.031*"good" + 0.029*"view" + 0.029*"room" + 0.029*"great" + 0.025*"stay" + 0.017*"service" + 0.015*"experience" + 0.015*"place"'),
 (3,
  '0.052*"room" + 0.029*"pool" + 0.027*"hotel" + 0.022*"view" + 0.014*"floor" + 0.013*"night" + 0.010*"check" + 0.010*"stay" + 0.010*"club" + 0.009*"city"')]

In [37]:
# Let's start with 5 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=5, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.052*"hotel" + 0.050*"pool" + 0.046*"room" + 0.040*"view" + 0.024*"great" + 0.023*"good" + 0.021*"stay" + 0.018*"night" + 0.013*"floor" + 0.013*"city"'),
 (1,
  '0.059*"hotel" + 0.035*"marina" + 0.033*"bay" + 0.032*"stay" + 0.029*"service" + 0.027*"experience" + 0.027*"sand" + 0.023*"good" + 0.023*"pool" + 0.018*"staff"'),
 (2,
  '0.045*"hotel" + 0.030*"bay" + 0.027*"view" + 0.021*"marina" + 0.018*"casino" + 0.017*"pool" + 0.014*"restaurant" + 0.014*"top" + 0.014*"place" + 0.013*"mall"'),
 (3,
  '0.050*"room" + 0.022*"check" + 0.016*"hotel" + 0.011*"time" + 0.009*"guest" + 0.007*"floor" + 0.007*"pool" + 0.007*"tower" + 0.007*"day" + 0.007*"service"'),
 (4,
  '0.052*"room" + 0.051*"club" + 0.038*"staff" + 0.027*"suite" + 0.025*"stay" + 0.025*"service" + 0.018*"view" + 0.017*"great" + 0.017*"upgrade" + 0.014*"check"')]

In [38]:
# Let's start with 6 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=6, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.050*"hotel" + 0.033*"view" + 0.027*"pool" + 0.021*"bay" + 0.019*"casino" + 0.017*"good" + 0.016*"great" + 0.016*"restaurant" + 0.016*"top" + 0.015*"mall"'),
 (1,
  '0.052*"room" + 0.029*"club" + 0.023*"floor" + 0.016*"breakfast" + 0.013*"th" + 0.013*"bathroom" + 0.011*"view" + 0.011*"tea" + 0.011*"lounge" + 0.010*"night"'),
 (2,
  '0.050*"hotel" + 0.044*"marina" + 0.042*"bay" + 0.029*"sand" + 0.023*"business" + 0.014*"conference" + 0.013*"property" + 0.011*"place" + 0.010*"world" + 0.010*"many"'),
 (3,
  '0.050*"hotel" + 0.049*"room" + 0.038*"pool" + 0.018*"good" + 0.018*"view" + 0.015*"stay" + 0.013*"great" + 0.013*"people" + 0.011*"nice" + 0.010*"time"'),
 (4,
  '0.036*"room" + 0.030*"check" + 0.027*"staff" + 0.027*"service" + 0.017*"time" + 0.015*"hotel" + 0.014*"stay" + 0.011*"arrive" + 0.009*"book" + 0.008*"desk"'),
 (5,
  '0.049*"hotel" + 0.045*"pool" + 0.045*"room" + 0.043*"view" + 0.035*"stay" + 0.027*"great" + 0.023*"good" + 0.021*"night" + 0.021*"experience" + 0.018

## Identify Topics in Best Models

# Topic Modeling with Bad reviews (rating <= 2)

In [39]:
df_bad_reviews= df.loc[df.rating<=2]
df_bad_reviews.shape

(1669, 18)

In [40]:
# Create document-term matrix
cv = CountVectorizer()
data_cv = cv.fit_transform(df_bad_reviews.cleaned_review)

In [41]:
data_cv.shape

(1669, 6317)

## Topic Modeling - Attempt #1 Complete Review Text

In [42]:
data = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names_out())
data.index = df_bad_reviews.index
data

Unnamed: 0,aback,abacus,abandon,abide,ability,able,aboard,abrasive,abroad,abrupt,...,youngster,youth,yr,yuck,zealous,zee,zero,zircon,zone,zoo
31,0,0,0,0,0,2,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19363,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19364,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [43]:
# One of the required inputs is a term-document matrix
tdm = data.transpose()
tdm.head()

Unnamed: 0,31,33,35,42,53,84,96,107,108,118,...,19356,19357,19360,19361,19362,19363,19364,19368,19369,19370
aback,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abacus,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abandon,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abide,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ability,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
# We're going to put the term-document matrix into a new gensim format, from df --> sparse matrix --> gensim corpus
sparse_counts = scipy.sparse.csr_matrix(tdm)
corpus = matutils.Sparse2Corpus(sparse_counts)

In [45]:
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
id2word = dict((v, k) for k, v in cv.vocabulary_.items())

In [46]:
# LDA for num_topics = 3
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=3, passes=10)
lda.print_topics()

[(0,
  '0.019*"hotel" + 0.019*"room" + 0.017*"check" + 0.013*"pool" + 0.012*"stay" + 0.010*"get" + 0.008*"go" + 0.007*"time" + 0.007*"wait" + 0.006*"like"'),
 (1,
  '0.034*"hotel" + 0.028*"room" + 0.016*"stay" + 0.015*"pool" + 0.014*"service" + 0.011*"get" + 0.011*"check" + 0.009*"view" + 0.009*"one" + 0.008*"go"'),
 (2,
  '0.028*"room" + 0.021*"hotel" + 0.017*"check" + 0.013*"staff" + 0.011*"service" + 0.008*"guest" + 0.008*"get" + 0.008*"stay" + 0.008*"go" + 0.008*"call"')]

In [47]:
# LDA for num_topics = 4
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=4, passes=10)
lda.print_topics()

[(0,
  '0.032*"room" + 0.024*"hotel" + 0.018*"check" + 0.014*"stay" + 0.012*"service" + 0.011*"get" + 0.009*"pool" + 0.009*"staff" + 0.008*"go" + 0.008*"time"'),
 (1,
  '0.043*"hotel" + 0.020*"pool" + 0.016*"stay" + 0.013*"service" + 0.010*"like" + 0.009*"go" + 0.009*"guest" + 0.009*"room" + 0.009*"get" + 0.008*"one"'),
 (2,
  '0.023*"hotel" + 0.023*"room" + 0.018*"pool" + 0.010*"get" + 0.010*"staff" + 0.009*"like" + 0.009*"club" + 0.009*"check" + 0.009*"go" + 0.008*"one"'),
 (3,
  '0.031*"room" + 0.026*"hotel" + 0.014*"service" + 0.011*"stay" + 0.010*"check" + 0.009*"charge" + 0.008*"day" + 0.008*"one" + 0.008*"night" + 0.008*"breakfast"')]

In [48]:
# LDA for num_topics = 5
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=5, passes=10)
lda.print_topics()

[(0,
  '0.035*"room" + 0.019*"check" + 0.019*"hotel" + 0.013*"get" + 0.012*"service" + 0.009*"go" + 0.009*"tell" + 0.009*"time" + 0.009*"staff" + 0.009*"stay"'),
 (1,
  '0.024*"room" + 0.012*"staff" + 0.011*"club" + 0.010*"check" + 0.009*"guest" + 0.008*"go" + 0.008*"get" + 0.007*"queue" + 0.007*"pool" + 0.007*"floor"'),
 (2,
  '0.037*"hotel" + 0.024*"room" + 0.021*"pool" + 0.020*"stay" + 0.010*"get" + 0.010*"check" + 0.010*"like" + 0.010*"view" + 0.010*"service" + 0.009*"night"'),
 (3,
  '0.026*"room" + 0.022*"hotel" + 0.013*"check" + 0.011*"pool" + 0.010*"stay" + 0.009*"service" + 0.008*"one" + 0.007*"staff" + 0.007*"make" + 0.006*"get"'),
 (4,
  '0.037*"hotel" + 0.019*"service" + 0.014*"room" + 0.013*"staff" + 0.011*"guest" + 0.011*"stay" + 0.011*"check" + 0.009*"star" + 0.009*"one" + 0.008*"go"')]

In [49]:
# LDA for num_topics = 6
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=6, passes=10)
lda.print_topics()

[(0,
  '0.034*"hotel" + 0.024*"room" + 0.024*"pool" + 0.017*"stay" + 0.012*"get" + 0.011*"check" + 0.010*"service" + 0.010*"like" + 0.009*"go" + 0.009*"view"'),
 (1,
  '0.030*"hotel" + 0.018*"room" + 0.015*"check" + 0.014*"staff" + 0.012*"service" + 0.011*"guest" + 0.010*"stay" + 0.009*"one" + 0.008*"go" + 0.008*"pool"'),
 (2,
  '0.019*"check" + 0.018*"card" + 0.013*"get" + 0.012*"credit" + 0.012*"go" + 0.012*"charge" + 0.012*"room" + 0.011*"back" + 0.011*"hotel" + 0.009*"call"'),
 (3,
  '0.025*"room" + 0.023*"hotel" + 0.019*"service" + 0.014*"check" + 0.013*"stay" + 0.012*"wait" + 0.010*"get" + 0.009*"staff" + 0.009*"time" + 0.008*"take"'),
 (4,
  '0.026*"hotel" + 0.017*"stay" + 0.016*"service" + 0.013*"room" + 0.009*"night" + 0.008*"bay" + 0.008*"marina" + 0.008*"would" + 0.007*"well" + 0.007*"pool"'),
 (5,
  '0.056*"room" + 0.019*"hotel" + 0.019*"check" + 0.013*"get" + 0.011*"service" + 0.011*"tell" + 0.011*"stay" + 0.009*"would" + 0.008*"night" + 0.008*"go"')]

## Topic Modeling - Attempt #2 (Nouns Only)

In [50]:
data_nouns = pd.DataFrame(df_bad_reviews.cleaned_review.apply(nouns))
data_nouns

Unnamed: 0,cleaned_review
31,try check room reservation line check counter ...
33,twin room state floor level level book room re...
35,way money waste place pay buffet taste heavy r...
42,money service room price level night access th...
53,time stay time bay sand time home rave enemy b...
...,...
19363,guest stay day bay sand conference rate room d...
19364,stay june business trip min tell room sort com...
19368,time marina bay sand night stay comedy error p...
19369,conference hotel day conference venue place st...


In [51]:
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer()
data_cvn = cvn.fit_transform(data_nouns.cleaned_review)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,aback,abacus,ability,absence,absent,absolute,absorption,absurd,absurdity,abu,...,york,youngster,youth,yr,yuck,zee,zero,zircon,zone,zoo
31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19363,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19364,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [52]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [53]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.032*"hotel" + 0.028*"room" + 0.026*"staff" + 0.019*"service" + 0.014*"charge" + 0.013*"call" + 0.013*"check" + 0.013*"day" + 0.011*"time" + 0.010*"manager"'),
 (1,
  '0.061*"room" + 0.055*"hotel" + 0.026*"pool" + 0.022*"service" + 0.015*"view" + 0.014*"check" + 0.013*"stay" + 0.012*"night" + 0.012*"staff" + 0.010*"people"'),
 (2,
  '0.055*"hotel" + 0.048*"room" + 0.024*"check" + 0.023*"service" + 0.021*"pool" + 0.019*"time" + 0.015*"staff" + 0.014*"night" + 0.010*"view" + 0.010*"hour"')]

In [54]:
# Let's try topics = 4
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.045*"hotel" + 0.040*"room" + 0.024*"service" + 0.020*"staff" + 0.018*"check" + 0.015*"card" + 0.014*"time" + 0.014*"charge" + 0.012*"day" + 0.011*"experience"'),
 (1,
  '0.075*"room" + 0.050*"hotel" + 0.020*"check" + 0.018*"service" + 0.016*"night" + 0.015*"time" + 0.014*"book" + 0.014*"floor" + 0.013*"staff" + 0.013*"stay"'),
 (2,
  '0.038*"hotel" + 0.029*"room" + 0.026*"check" + 0.023*"service" + 0.016*"bay" + 0.016*"marina" + 0.015*"staff" + 0.014*"time" + 0.012*"experience" + 0.012*"sand"'),
 (3,
  '0.060*"hotel" + 0.044*"pool" + 0.038*"room" + 0.023*"service" + 0.020*"view" + 0.015*"people" + 0.015*"staff" + 0.013*"stay" + 0.011*"night" + 0.010*"price"')]

In [55]:
# Let's try topics = 5
ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.076*"room" + 0.038*"hotel" + 0.024*"check" + 0.016*"staff" + 0.016*"service" + 0.014*"pool" + 0.013*"floor" + 0.013*"night" + 0.012*"hour" + 0.012*"call"'),
 (1,
  '0.075*"hotel" + 0.040*"room" + 0.025*"service" + 0.022*"pool" + 0.017*"stay" + 0.016*"night" + 0.014*"view" + 0.013*"people" + 0.012*"check" + 0.012*"money"'),
 (2,
  '0.039*"hotel" + 0.037*"room" + 0.027*"service" + 0.027*"check" + 0.020*"staff" + 0.019*"time" + 0.013*"day" + 0.013*"night" + 0.013*"charge" + 0.013*"bay"'),
 (3,
  '0.067*"room" + 0.035*"hotel" + 0.022*"card" + 0.020*"service" + 0.014*"time" + 0.013*"staff" + 0.012*"stay" + 0.012*"pool" + 0.012*"pay" + 0.009*"work"'),
 (4,
  '0.058*"hotel" + 0.038*"pool" + 0.035*"room" + 0.023*"service" + 0.019*"staff" + 0.016*"view" + 0.015*"time" + 0.013*"experience" + 0.012*"check" + 0.011*"people"')]

In [56]:
# Let's try topics = 6
ldan = models.LdaModel(corpus=corpusn, num_topics=6, id2word=id2wordn, passes=10)
ldan.print_topics()

[(0,
  '0.056*"hotel" + 0.035*"room" + 0.031*"service" + 0.024*"staff" + 0.021*"stay" + 0.020*"pool" + 0.016*"view" + 0.015*"bay" + 0.013*"marina" + 0.012*"time"'),
 (1,
  '0.043*"room" + 0.026*"hotel" + 0.023*"pool" + 0.015*"time" + 0.012*"service" + 0.009*"staff" + 0.009*"view" + 0.009*"check" + 0.008*"book" + 0.008*"look"'),
 (2,
  '0.066*"hotel" + 0.051*"room" + 0.027*"pool" + 0.022*"service" + 0.018*"check" + 0.014*"time" + 0.013*"staff" + 0.012*"night" + 0.010*"people" + 0.010*"stay"'),
 (3,
  '0.036*"staff" + 0.019*"bar" + 0.016*"customer" + 0.015*"service" + 0.009*"car" + 0.009*"call" + 0.008*"day" + 0.008*"experience" + 0.008*"park" + 0.008*"check"'),
 (4,
  '0.080*"room" + 0.031*"hotel" + 0.029*"check" + 0.019*"service" + 0.017*"staff" + 0.015*"night" + 0.013*"book" + 0.013*"day" + 0.012*"hour" + 0.012*"time"'),
 (5,
  '0.021*"hotel" + 0.014*"night" + 0.013*"room" + 0.012*"service" + 0.011*"manager" + 0.010*"book" + 0.009*"check" + 0.008*"way" + 0.007*"day" + 0.007*"pool"')]

## Topic Modeling - Attempt #3 (Nouns and Adjectives)

In [57]:
data_nouns_adj = pd.DataFrame(df_bad_reviews.cleaned_review.apply(nouns_adj))
data_nouns_adj

Unnamed: 0,cleaned_review
31,arrive try check room reservation line check c...
33,deluxe twin room state low floor level level b...
35,way expensive money waste place pay buffet tas...
42,worth money iconic service outstanding room ni...
53,second time stay second time stay bay sand tim...
...,...
19363,ready welcome guest stay day bay sand conferen...
19364,stay june business trip min tell room ready so...
19368,ready prime time marina bay sand foreseeable f...
19369,bad conference hotel day conference bad venue ...


In [58]:
# Create a new document-term matrix using only nouns and adjectives, also remove common words with max_df
cvna = CountVectorizer(max_df=.8) #, max_df is used for removing data values that appear too frequently, also known as "corpus-specific stop words".
# For example, max_df=.8 means "It ignores terms that appear in more than 80% of the documents".
data_cvna = cvna.fit_transform(data_nouns_adj.cleaned_review)
data_dtmna = pd.DataFrame(data_cvna.toarray(), columns=cvna.get_feature_names_out())
data_dtmna.index = data_nouns_adj.index
data_dtmna

Unnamed: 0,aback,abacus,abide,ability,able,abrasive,abrupt,absence,absent,absolute,...,youngster,youth,yr,yuck,zealous,zee,zero,zircon,zone,zoo
31,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
42,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
53,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19363,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19364,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
19369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [59]:
# Create the gensim corpus
corpusna = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmna.transpose()))

# Create the vocabulary dictionary
id2wordna = dict((v, k) for k, v in cvna.vocabulary_.items())

In [60]:
# Let's start with 3 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=3, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.054*"room" + 0.015*"service" + 0.014*"check" + 0.013*"staff" + 0.011*"stay" + 0.010*"night" + 0.010*"time" + 0.010*"call" + 0.009*"day" + 0.008*"pool"'),
 (1,
  '0.032*"room" + 0.020*"pool" + 0.017*"service" + 0.014*"check" + 0.014*"stay" + 0.010*"view" + 0.010*"staff" + 0.010*"time" + 0.009*"people" + 0.009*"good"'),
 (2,
  '0.022*"pool" + 0.019*"room" + 0.018*"service" + 0.012*"staff" + 0.011*"star" + 0.010*"experience" + 0.008*"guest" + 0.008*"bad" + 0.008*"view" + 0.007*"night"')]

In [61]:
# Let's start with 4 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=4, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.030*"room" + 0.011*"service" + 0.010*"pool" + 0.009*"staff" + 0.009*"check" + 0.009*"stay" + 0.007*"time" + 0.007*"guest" + 0.007*"club" + 0.006*"pay"'),
 (1,
  '0.045*"room" + 0.020*"service" + 0.017*"check" + 0.014*"staff" + 0.012*"time" + 0.011*"pool" + 0.011*"stay" + 0.009*"bad" + 0.009*"guest" + 0.009*"night"'),
 (2,
  '0.021*"room" + 0.012*"staff" + 0.009*"pool" + 0.008*"good" + 0.008*"service" + 0.008*"people" + 0.008*"experience" + 0.006*"bed" + 0.006*"check" + 0.006*"guest"'),
 (3,
  '0.031*"room" + 0.030*"pool" + 0.015*"stay" + 0.014*"view" + 0.014*"service" + 0.011*"night" + 0.010*"people" + 0.009*"good" + 0.009*"check" + 0.008*"staff"')]

In [62]:
# Let's start with 5 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=5, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.007*"key" + 0.006*"time" + 0.006*"guest" + 0.006*"service" + 0.006*"room" + 0.005*"card" + 0.005*"book" + 0.005*"casino" + 0.005*"bad" + 0.004*"nothing"'),
 (1,
  '0.023*"service" + 0.023*"room" + 0.018*"staff" + 0.015*"stay" + 0.015*"pool" + 0.012*"card" + 0.009*"guest" + 0.008*"star" + 0.008*"day" + 0.007*"great"'),
 (2,
  '0.036*"room" + 0.035*"pool" + 0.015*"check" + 0.014*"service" + 0.013*"view" + 0.013*"stay" + 0.011*"people" + 0.010*"night" + 0.009*"good" + 0.009*"time"'),
 (3,
  '0.023*"room" + 0.018*"staff" + 0.015*"guest" + 0.015*"service" + 0.010*"stay" + 0.010*"time" + 0.010*"check" + 0.009*"night" + 0.009*"experience" + 0.009*"star"'),
 (4,
  '0.054*"room" + 0.019*"service" + 0.016*"check" + 0.012*"staff" + 0.012*"time" + 0.011*"call" + 0.010*"stay" + 0.009*"book" + 0.008*"hour" + 0.008*"night"')]

In [63]:
# Let's start with 6 topics
ldana = models.LdaModel(corpus=corpusna, num_topics=6, id2word=id2wordna, passes=10)
ldana.print_topics()

[(0,
  '0.069*"room" + 0.021*"check" + 0.015*"service" + 0.012*"time" + 0.011*"hour" + 0.010*"call" + 0.010*"book" + 0.009*"view" + 0.009*"night" + 0.008*"staff"'),
 (1,
  '0.032*"room" + 0.027*"pool" + 0.015*"card" + 0.013*"check" + 0.012*"view" + 0.011*"great" + 0.011*"service" + 0.011*"stay" + 0.009*"time" + 0.009*"club"'),
 (2,
  '0.029*"room" + 0.026*"service" + 0.020*"staff" + 0.012*"stay" + 0.011*"pool" + 0.010*"check" + 0.010*"time" + 0.008*"night" + 0.008*"day" + 0.008*"guest"'),
 (3,
  '0.033*"pool" + 0.027*"room" + 0.018*"stay" + 0.011*"view" + 0.011*"service" + 0.010*"bad" + 0.010*"place" + 0.009*"experience" + 0.009*"night" + 0.009*"star"'),
 (4,
  '0.020*"room" + 0.015*"pool" + 0.009*"service" + 0.009*"time" + 0.009*"staff" + 0.009*"view" + 0.008*"stay" + 0.008*"people" + 0.008*"good" + 0.008*"guest"'),
 (5,
  '0.029*"pool" + 0.016*"room" + 0.015*"check" + 0.015*"stay" + 0.013*"service" + 0.012*"people" + 0.011*"night" + 0.010*"good" + 0.009*"staff" + 0.008*"line"')]

## Identify Topics in Best Models