
# **Topic Modeling**
## **Introduction**






Another popular text analysis technique is called topic modeling. The ultimate goal of topic modeling is to find various topics that are present in your corpus. Each document in the corpus will be made up of at least one topic, if not multiple topics.

In this notebook, we will be covering the steps on how to do **Latent Dirichlet Allocation (LDA)**, which is one of many topic modeling techniques. It was specifically designed for text data.

To use a topic modeling technique, you need to provide



> (1) a document-term matrix and

> (2) the number of topics you would like the algorithm to pick up.






Once the topic modeling technique is applied, your job as a human is to interpret the results and see if the mix of words in each topic make sense. If they don't make sense, you can try changing up the number of topics, the terms in the document-term matrix, model parameters, or even try a different model.


# NOTE: LDA takes VERY LONG TO RUN

**Importing the data**

In [1]:
import pandas as pd
import numpy as np
import pyprojroot.here as here

# Data Cleaning

In [2]:
data1 = pd.read_csv(here("data/processed/cleaned_swissotel-the-stamford.csv"))
data2 = pd.read_csv(here("data/processed/cleaned_mbs_total.csv"))
data3 = pd.read_csv(here("data/processed/cleaned_pan-pacific.csv"))
data4 = pd.read_csv(here("data/processed/cleaned_parkroyal-collection-marina-bay.csv"))
data5 = pd.read_csv(here("data/processed/cleaned_fullerton.csv"))
df = pd.concat([data1, data2, data3, data4, data5], ignore_index = True)
print(len(data1))
print(len(data2))
print(len(data3))
print(len(data4))
print(len(data5))
len(df)

5058
10523
7430
6237
6374


35622

In [3]:
# Load the dataset
df.head(3)

Unnamed: 0.1,Unnamed: 0,traveller_username,review_title,review_text,travel_type,traveller_country_origin,traveller_total_contributions,traveller_total_helpful_contributions,rating,valid_rating,label,cleaned_review,combined_review,date,covid,year,stem_review,lem_review
0,0,Ernest L,Excellent stay with fantastic scenery,This is our 4th stay. I still remember that I ...,Trip type: Travelled with family,,1.0,,5.0,True,Positive,excellent stay fantastic scenery th stay still...,Excellent stay with fantastic scenery This is ...,2023-09-01,PostCovid,2023,excel stay fantast sceneri th stay still remem...,excellent stay fantastic scenery th stay still...
1,1,lovemylife999,Pampercation,64th floor Crest Suite Harbour view. Spacious ...,Trip type: Travelled as a couple,"Singapore, Singapore",91.0,2.0,,False,,pampercation th floor crest suite harbour view...,Pampercation 64th floor Crest Suite Harbour vi...,2023-05-01,PostCovid,2023,pamperc th floor crest suit harbour view spaci...,pampercation th floor crest suite harbour view...
2,2,JOSE JOAQUIN ORTIZ GARCIA,Excellent location and facilities as a confere...,"The hotel is located near the marina, so going...",Trip type: Travelled on business,"Chia, Colombia",32.0,,5.0,True,Positive,excellent location facilities conference venue...,Excellent location and facilities as a confere...,2023-09-01,PostCovid,2023,excel locat facil confer venu hotel locat near...,excellent location facility conference venue h...


In [4]:
df[df.traveller_username == 'Jayne Jeong'].review_text.values

array([' As a remarkable landmark of Singapore, I recommend this hotel. As you know, it is known for its great infinity pool and definitely it is wonderful. And also the staffs are kind and professional. There are several Korean staffs and they are ready to help you. That is why I tried this hotel twice - first was 2016 and this time. A room condition was okay and met my expectation.   * 수영장 하나만으로도 값어치 한다고 생각합니다. 인피니티 풀에서 보는 뷰는 낮/밤 모두 환상적이니 꼭 챙겨보세요 :) '],
      dtype=object)

In [5]:
df.isnull().sum()

Unnamed: 0                                   0
traveller_username                           0
review_title                                32
review_text                                  0
travel_type                              17656
traveller_country_origin                  9043
traveller_total_contributions              138
traveller_total_helpful_contributions     6030
rating                                    4937
valid_rating                                 0
label                                     4937
cleaned_review                               0
combined_review                              0
date                                         0
covid                                        0
year                                         0
stem_review                                  0
lem_review                                   0
dtype: int64

### Create document-term matrix


In [6]:
df.shape

(35622, 18)

# Topic Modeling with Good reviews (rating >= 4)

In [7]:
df_good_reviews= df.loc[df.rating>=4]
df_good_reviews.shape

(26961, 18)

In [8]:
# Create document-term matrix
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
data_cv = cv.fit_transform(df_good_reviews.cleaned_review)

In [9]:
data_cv.shape

(26961, 26109)

In [10]:
df_good_reviews.head()

Unnamed: 0.1,Unnamed: 0,traveller_username,review_title,review_text,travel_type,traveller_country_origin,traveller_total_contributions,traveller_total_helpful_contributions,rating,valid_rating,label,cleaned_review,combined_review,date,covid,year,stem_review,lem_review
0,0,Ernest L,Excellent stay with fantastic scenery,This is our 4th stay. I still remember that I ...,Trip type: Travelled with family,,1.0,,5.0,True,Positive,excellent stay fantastic scenery th stay still...,Excellent stay with fantastic scenery This is ...,2023-09-01,PostCovid,2023,excel stay fantast sceneri th stay still remem...,excellent stay fantastic scenery th stay still...
2,2,JOSE JOAQUIN ORTIZ GARCIA,Excellent location and facilities as a confere...,"The hotel is located near the marina, so going...",Trip type: Travelled on business,"Chia, Colombia",32.0,,5.0,True,Positive,excellent location facilities conference venue...,Excellent location and facilities as a confere...,2023-09-01,PostCovid,2023,excel locat facil confer venu hotel locat near...,excellent location facility conference venue h...
3,3,Changboo C,Great experience while staying the Swissotel,Was the great staying. I have watched the whol...,Trip type: Travelled on business,,1.0,,5.0,True,Positive,great experience staying swissotel great stayi...,Great experience while staying the Swissotel W...,2023-09-01,PostCovid,2023,great experi stay swissotel great stay watch w...,great experience stay swissotel great staying ...
5,5,Shu Mun L,Great stay,"Great location. Very convenient for food, shop...",Trip type: Travelled on business,"Kuala Lumpur, Malaysia",8.0,1.0,5.0,True,Positive,great stay great location convenient food shop...,Great stay Great location. Very convenient for...,2023-08-01,PostCovid,2023,great stay great locat conveni food shop walk ...,great stay great location convenient food shop...
6,6,Cruiser62845265595,A great place to stay,The rooms are quiet and clean with easy access...,Trip type: Travelled with family,,1.0,,5.0,True,Positive,great place stay rooms quiet clean easy access...,A great place to stay The rooms are quiet and ...,2023-09-01,PostCovid,2023,great place stay room quiet clean easi access ...,great place stay room quiet clean easy access ...


## Topic Modeling - Attempt #1 (Nouns Only)

In [11]:
from nltk.corpus import stopwords
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import string
from nltk import word_tokenize, pos_tag

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ammarbagharib/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/ammarbagharib/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [16]:
from gensim import matutils, models
import scipy.sparse

In [12]:
# Let's create a function to pull out nouns from a string of text
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    is_noun = lambda pos: pos[:2] == 'NN'
    tokenized = word_tokenize(text)
    all_nouns = [word for (word, pos) in pos_tag(tokenized) if is_noun(pos)]
    return ' '.join(all_nouns)

In [13]:
data_nouns = pd.DataFrame(df_good_reviews.cleaned_review.apply(nouns))
data_nouns

Unnamed: 0,cleaned_review
0,stay scenery th stay wife hotel hotel times wa...
2,location facilities conference hotel marina mo...
3,experience swissotel f singapore view location...
5,stay location convenient food shopping tourist...
6,place stay rooms access mrt facilities staff f...
...,...
35617,fullerton hotels hotel star chain surpass room...
35618,stay stay location service breakfast room faci...
35619,elegance comfort fullerton hotel position sing...
35620,sunday brunch singapore afternoon head brunch ...


In [14]:
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer()
data_cvn = cvn.fit_transform(data_nouns.cleaned_review)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn

Unnamed: 0,aaaaaa,aahhhh,aaleyah,aaron,aawesome,abalone,abandon,abby,abd,abdi,...,zoo,zool,zoom,zoos,zul,zulfadhli,zulfadli,zura,zurich,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35617,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35618,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35619,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35620,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

NameError: name 'matutils' is not defined

In [None]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 4
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 5
ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 6
ldan = models.LdaModel(corpus=corpusn, num_topics=6, id2word=id2wordn, passes=10)
ldan.print_topics()

# Topic Modeling with Bad reviews (rating <= 2)

## Topic Modeling - Attempt #2 (Nouns Only)

In [None]:
data_nouns = pd.DataFrame(df_bad_reviews.cleaned_review.apply(nouns))
data_nouns

In [None]:
# Recreate a document-term matrix with only nouns
cvn = CountVectorizer()
data_cvn = cvn.fit_transform(data_nouns.cleaned_review)
data_dtmn = pd.DataFrame(data_cvn.toarray(), columns=cvn.get_feature_names_out())
data_dtmn.index = data_nouns.index
data_dtmn

In [None]:
# Create the gensim corpus
corpusn = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(data_dtmn.transpose()))

# Create the vocabulary dictionary
id2wordn = dict((v, k) for k, v in cvn.vocabulary_.items())

In [None]:
# Let's try topics = 3
ldan = models.LdaModel(corpus=corpusn, num_topics=3, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 4
ldan = models.LdaModel(corpus=corpusn, num_topics=4, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 5
ldan = models.LdaModel(corpus=corpusn, num_topics=5, id2word=id2wordn, passes=10)
ldan.print_topics()

In [None]:
# Let's try topics = 6
ldan = models.LdaModel(corpus=corpusn, num_topics=6, id2word=id2wordn, passes=10)
ldan.print_topics()