# Topic Modeling

Testing topic modeling on a tripadvisor hotel reviews dataset

1. Load the dataset and preprocess the reviews
2. Perform Topic Modeling using two different libraries:
    1. sklearn LDA: Tune the number of topics, learning decay and batch size values
    2. gensim LDA: Tune the number of topics, chunk size and passes values

Parameters explanation:
- **num_topics**: The number of topics to be extracted from the corpus.
- **learning_decay**: The rate at which the learning rate decreases over time.
- **batch_size**: The number of documents to use in each EM step.
- **chunksize**: The number of documents to be used in each training chunk.
- **passes**: The number of passes through the corpus during training.

In [46]:
from typing import List
import os

import pandas as pd
import gensim
from gensim import corpora
from gensim.models import CoherenceModel
import nltk
from nltk import pos_tag, word_tokenize, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gianl\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

if you have incompatibility problems between gensim and scipy:
- uninstall current version of scipy
- run `pip install scipy==1.10.1`

## 1. Load the dataset and preprocess the reviews

In [47]:
lemmatizer = WordNetLemmatizer()

def preprocess(review: str) -> List[str]:
    tokens = word_tokenize(review.lower())
    # remove punctuation
    tokens = [word for word in tokens if word.isalpha()]
    # remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    # lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    # keep only nouns
    tokens = [word for word, pos in pos_tag(tokens) if pos.startswith('N')]
    return tokens

If the preprocessed reviews file does not exist, preprocess the reviews and save them to a file

In [48]:
if 'reviews_preprocessed.txt' not in os.listdir('resources'):
    # preprocess the reviews
    df = pd.read_csv('resources/reviews.csv', nrows=3000)
    reviews = df['Review']
    reviews = [review.strip() for review in reviews] # remove newline characters from each review
    # remove punctuation, stopwords, lemmatize and keep only nouns
    reviews = [preprocess(review) for review in reviews]
    # save the preprocessed reviews to a file
    with open('resources/reviews_preprocessed.txt', 'w') as f:
        for review in reviews:
            f.write(','.join(review) + '\n')
else:
    # load the preprocessed reviews
    reviews = []
    with open('resources/reviews_preprocessed.txt', 'r') as f:
        for line in f:
            reviews.append(line.strip().split(','))

Print the first review after preprocessing

In [49]:
reviews[0]

['hotel',
 'parking',
 'deal',
 'hotel',
 'evening',
 'review',
 'valet',
 'check',
 'view',
 'room',
 'room',
 'size',
 'woke',
 'pillow',
 'soundproof',
 'heard',
 'music',
 'room',
 'night',
 'morning',
 'loud',
 'bang',
 'door',
 'closing',
 'people',
 'neighbor',
 'bath',
 'product',
 'stay',
 'advantage',
 'location',
 'distance',
 'experience',
 'pay',
 'parking',
 'night']

## 2. Topic Modeling

### 2.1. sklearn LDA

In [50]:
def sklearn_lda_evaluate_models(reviews: List[List[str]], search_params: dict):
    """
    Evaluate the LDA model using sklearn's implementation for different parameters settings.
    :param reviews: list of preprocessed reviews
    :param search_params: dictionary containing the parameters to be tuned and their possible values
    :return: the best model found by the best combination of parameters
    """
    reviews = [' '.join(review) for review in reviews]
    
    # convert the reviews to a term-document matrix
    vectorizer = CountVectorizer()
    data_vectorized = vectorizer.fit_transform(reviews)
    
    lda = LatentDirichletAllocation(learning_method='online')
    
    # initiate GridSearchCV
    model = GridSearchCV(lda, param_grid=search_params)
    
    # fit the GridSearchCV model
    model.fit(data_vectorized)
    
    return model

### 2.2. gensim LDA

#### Create a dictionary and a corpus from the preprocessed reviews as required by gensim's LDA model

Dictionary:
- The dictionary encapsulates the mapping between **normalized words** (nouns in this case) and their **integer ids**.
- Each unique word is assigned a unique id.

Corpus:
- The corpus is a list of documents where each document is represented as a list of tuples.
- Each tuple consists of a word's integer id and its frequency in the document.
- This method converts each document (a list of words) into the bag-of-words format.

In [51]:
dictionary = corpora.Dictionary(reviews)

corpus = [dictionary.doc2bow(text) for text in reviews]

Print dictionary and corpus samples

In [79]:
# print dictionary sample
print("Dictionary Sample:")
for i, (word_id, word) in enumerate(dictionary.iteritems()):
    print(f"ID {word_id}: {word}")
    if i == 4:
        break
        
print("\n")

# print the BoW representation for the first 3 documents in the corpus.
print("Corpus Sample:")
# format output to show word counts along with their corresponding words
formatted_doc = [(dictionary[word_id], count) for word_id, count in corpus[0]]
print(formatted_doc[:10])

Dictionary Sample:
ID 0: advantage
ID 1: bang
ID 2: bath
ID 3: check
ID 4: closing


Corpus Sample:
[('advantage', 1), ('bang', 1), ('bath', 1), ('check', 1), ('closing', 1), ('deal', 1), ('distance', 1), ('door', 1), ('evening', 1), ('experience', 1)]


#### Evaluate the LDA model using gensim's implementation for different number of topics, chunksize and passes values

In [53]:
def gensim_lda_evaluate_models(corpus: List[List[str]],
                    dictionary: corpora.Dictionary,
                    texts: List[List[str]],
                    topic_numbers: List[int],
                    chunksize_values: List[int],
                    passes_values: List[int],):
    results = []
    for num_topics in topic_numbers:
        for chucksize_value in chunksize_values:
            for passes_value in passes_values:
    
                print("Evaluating model with:")
                print(f"num_topics={num_topics}, chucksize={chucksize_value}, passes={passes_value}")
                
                model = gensim.models.ldamodel.LdaModel(
                    corpus=corpus,
                    id2word=dictionary,
                    num_topics=num_topics,
                    random_state=100,
                    chunksize=chucksize_value,
                    passes=passes_value,
                    alpha="auto",
                    eta="auto",
                    per_word_topics=True
                )
                coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
                coherence_score = coherencemodel.get_coherence()
                results.append((num_topics, chucksize_value, passes_value, coherence_score))
    return results

### Run the model evaluation for sklearn LDA

In [57]:
sklearn_search_params = {'n_components': list(range(2, 16, 2)), 'learning_decay': [0.5, 0.7, 0.9], 'batch_size': [100, 200]}

sklearn_results = sklearn_lda_evaluate_models(reviews, sklearn_search_params)

Print the top 5 models based on log likelihood score

In [76]:
temp_reviews = [' '.join(review) for review in reviews]

# convert the reviews to a term-document matrix
vectorizer = CountVectorizer()
data_vectorized = vectorizer.fit_transform(temp_reviews)

# print first 5 best models
for i in range(5):    
    print(f"Model Rank: {i+1}")
    print(f"Model's Params: {sklearn_results.cv_results_['params'][i]}")
    print(f"Model's Log Likelihood Score: {sklearn_results.cv_results_['mean_test_score'][i]}")
    print(f"Model's Perplexity Score: {sklearn_results.best_estimator_.perplexity(data_vectorized)}")
    print("\n")

Model Rank: 1
Model's Params: {'batch_size': 100, 'learning_decay': 0.5, 'n_components': 2}
Model's Log Likelihood Score: -215686.93851937214
Model's Perplexity Score: 865.0991630931732


Model Rank: 2
Model's Params: {'batch_size': 100, 'learning_decay': 0.5, 'n_components': 4}
Model's Log Likelihood Score: -222164.97607662837
Model's Perplexity Score: 865.0991630931732


Model Rank: 3
Model's Params: {'batch_size': 100, 'learning_decay': 0.5, 'n_components': 6}
Model's Log Likelihood Score: -228551.10209977944
Model's Perplexity Score: 865.0991630931732


Model Rank: 4
Model's Params: {'batch_size': 100, 'learning_decay': 0.5, 'n_components': 8}
Model's Log Likelihood Score: -234344.3129196315
Model's Perplexity Score: 865.0991630931732


Model Rank: 5
Model's Params: {'batch_size': 100, 'learning_decay': 0.5, 'n_components': 10}
Model's Log Likelihood Score: -240262.0742506627
Model's Perplexity Score: 865.0991630931732


Best Model's Params: {'batch_size': 200, 'learning_decay': 0.

You can see the results in the file `sklearn_res_1.txt` in the results folder.

A higher log likelihood score and a lower perplexity score indicate a better model.
We can see that the best model has 2 topics, a learning decay of 0.5 and a batch size of 100.

Note: perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words. A better measure is coherence score, as measured in the gensim LDA model.

### Run the model evaluation for gensim LDA

In [55]:
topic_numbers = list(range(2, 16, 2))
chunksize_values = [100, 200]
passes_values = [5, 10, 20]

# print coherence values to choose the best model
results = gensim_lda_evaluate_models(corpus=corpus,
                           dictionary=dictionary,
                           texts=reviews,
                           topic_numbers=topic_numbers,
                           chunksize_values=chunksize_values,
                           passes_values=passes_values)

Evaluating model with:
num_topics=2, chucksize=100, passes=5
Evaluating model with:
num_topics=2, chucksize=100, passes=10
Evaluating model with:
num_topics=2, chucksize=100, passes=20
Evaluating model with:
num_topics=2, chucksize=200, passes=5
Evaluating model with:
num_topics=2, chucksize=200, passes=10
Evaluating model with:
num_topics=2, chucksize=200, passes=20
Evaluating model with:
num_topics=4, chucksize=100, passes=5
Evaluating model with:
num_topics=4, chucksize=100, passes=10
Evaluating model with:
num_topics=4, chucksize=100, passes=20
Evaluating model with:
num_topics=4, chucksize=200, passes=5
Evaluating model with:
num_topics=4, chucksize=200, passes=10
Evaluating model with:
num_topics=4, chucksize=200, passes=20
Evaluating model with:
num_topics=6, chucksize=100, passes=5
Evaluating model with:
num_topics=6, chucksize=100, passes=10
Evaluating model with:
num_topics=6, chucksize=100, passes=20
Evaluating model with:
num_topics=6, chucksize=200, passes=5
Evaluating mod

Print the top 5 models based on coherence score

In [56]:
results = sorted(results, key=lambda x: x[3], reverse=True)
for num_topics, chucksize_value, passes_value, coherence_score in results[:5]:
    print(f"Num Topics: {num_topics}, Chucksize: {chucksize_value}, Passes: {passes_value}, Coherence Score: {coherence_score}")

Num Topics: 12, Chucksize: 200, Passes: 5, Coherence Score: 0.5404271700143645
Num Topics: 12, Chucksize: 200, Passes: 20, Coherence Score: 0.5380960508582104
Num Topics: 14, Chucksize: 200, Passes: 10, Coherence Score: 0.5341124664987279
Num Topics: 12, Chucksize: 200, Passes: 10, Coherence Score: 0.5322717434467844
Num Topics: 14, Chucksize: 200, Passes: 20, Coherence Score: 0.5321588549299386


You can see the results in the files `gensim_res_2.txt` and `gensim_res_1.txt` in the results folder.

The best model, according to Gensim, has 12 topics, a chunk size of 100 and 5 passes.
In this case we are considering the coherence score as the evaluation metric, which is a better measure than perplexity for the task of topic modeling. A higher coherence score indicates a better model.

### Show topics for the best Gensim model

In [86]:
best_gensim_model = gensim.models.ldamodel.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=12,
    random_state=100,
    chunksize=100,
    passes=5,
    alpha="auto",
    eta="auto",
    per_word_topics=True
)

In [92]:
topics = best_gensim_model.print_topics(num_words=10)
for topic in topics:
    print(topic)
    print("\n")

(0, '0.171*"la" + 0.084*"none" + 0.075*"idea" + 0.060*"comment" + 0.056*"mean" + 0.052*"run" + 0.017*"range" + 0.009*"fountain" + 0.008*"flaw" + 0.003*"sleeping"')


(1, '0.112*"orleans" + 0.085*"adult" + 0.040*"situation" + 0.039*"ceiling" + 0.034*"hurricane" + 0.034*"cold" + 0.033*"temperature" + 0.033*"conditioner" + 0.022*"odor" + 0.020*"period"')


(2, '0.097*"cut" + 0.009*"shock" + 0.000*"chopin" + 0.000*"gaucho" + 0.000*"bracelet" + 0.000*"coco" + 0.000*"repellent" + 0.000*"topless" + 0.000*"restuarants" + 0.000*"ceremony"')


(3, '0.056*"hotel" + 0.030*"restaurant" + 0.024*"place" + 0.022*"pool" + 0.019*"room" + 0.017*"lot" + 0.017*"area" + 0.016*"beach" + 0.015*"bar" + 0.014*"food"')


(4, '0.000*"houer" + 0.000*"contribution" + 0.000*"keen" + 0.000*"offeringsuggestions" + 0.000*"caulk" + 0.000*"rusty" + 0.000*"approximity" + 0.000*"marshal" + 0.000*"proprietor" + 0.000*"danger"')


(5, '0.095*"expectation" + 0.081*"reserve" + 0.047*"rude" + 0.046*"story" + 0.037*"dog" + 0.034