Evaluating the best LDA model from a hyperparemter search

We need
- The text (lemmatized), or any text
- The LDA model
- The corpus
- The id2word (can be generated on the fly)

In [12]:
import pandas as pd
import numpy as np
import pickle

import gensim

from pathlib import Path
from datetime import datetime
import json
import sys

In [2]:
%load_ext autoreload

In [3]:
sys.path.append('../')

%autoreload 2
from dataset_loader import GENRES, load_dataset

In [4]:
# constants

genre = GENRES.INDIE

Either we load a lemmatized data (preprocessed data) for simple quick analysis

Or we re-create the pre-processed data once again, and keeping track of the id

In [13]:
# Load the text

X_lemmatized_file = Path(f'lemmatized_data/{genre.value:02}_{str(genre)}.pkl')

if X_lemmatized_file.exists():
    with open(X_lemmatized_file, 'rb') as f:
        X_lemmatized = pickle.load(f)
    print(f'Loaded X_lemmatized')
    print("X_lemmatized len:", len(X_lemmatized))
else:
    raise 'X_lemmatized_file does not exist'

Loaded X_lemmatized
X_lemmatized len: 719448


In [14]:
X_lemmatized[0]

['take',
 'one',
 'part',
 'faerie',
 'solitaire',
 'two',
 'part',
 'puzzle',
 'quest',
 'mix',
 'little',
 'poker',
 'yahtzee',
 'good',
 'measure',
 'get',
 'something',
 'like',
 'runespell',
 'overture',
 'changeling',
 'sort',
 'fight',
 'monster',
 'take',
 'quest',
 'exchange',
 'coin',
 'buff',
 'come',
 'form',
 'power',
 'card',
 'story',
 'strong',
 'element',
 'game',
 'like',
 'puzzle',
 'quest',
 'game',
 'battle',
 'determine',
 'play',
 'mini',
 'game',
 'instead',
 'match',
 'though',
 'game',
 'card',
 'game',
 'similar',
 'poker',
 'make',
 'certain',
 'combination',
 'card',
 'pair',
 'kind',
 'full',
 'house',
 'flush',
 'straight',
 'certain',
 'amount',
 'damage',
 'opponent',
 'try',
 'ability',
 'steal',
 'card',
 'opponent',
 'plus',
 'limited',
 'number',
 'move',
 'get',
 'per',
 'turn',
 'move',
 'card',
 'play',
 'power',
 'ups',
 'add',
 'enough',
 'strategy',
 'game',
 'keep',
 'interest',
 'admittedly',
 'game',
 'get',
 'bit',
 'repetitive',
 'find',


Load the best model from search

In [6]:
# load the best model from training folder

training_datetime = datetime(2024, 2, 7, 18, 59, 39)

training_folder = Path(f'lda_multicore_grid_search_{training_datetime.strftime("%Y%m%d_%H%M%S")}')
training_result_json_path = training_folder.joinpath('result.json')
with open(training_result_json_path, 'r') as f:
    training_result = json.load(f)

best_model_checkpoint_path = Path(training_result['best_model_checkpoint'])

best_id2word = gensim.corpora.Dictionary.load(str(best_model_checkpoint_path.joinpath('lda_multicore.id2word')))
# best_corpus = [best_id2word.doc2bow(text) for text in X_lemmatized]      # recreate the corpus given the id2word (gensim Dictionary) (this is for new data)
best_corpus = gensim.corpora.MmCorpus(str(best_model_checkpoint_path.joinpath(f'{best_model_checkpoint_path.stem}_corpus.mm')))
best_model = gensim.models.ldamulticore.LdaMulticore.load(str(best_model_checkpoint_path.joinpath('lda_multicore')))

print('Best model checkpoint path:', best_model_checkpoint_path)

lda_model = best_model
id2word = best_id2word
corpus = best_corpus

Best model checkpoint path: lda_multicore_grid_search_20240207_185939/lda_multicore_lda_num_topics_20


Visualize the data

In [7]:
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, mds="mmds", R=10)
vis



Get top 10 keywords for each topics

In [8]:
top_N_words = 10

for i, topic in lda_model.show_topics(num_topics=lda_model.num_topics, num_words=top_N_words, formatted=False):
    print(f'Topic {i}:')
    print(', '.join([word for word, _ in topic]))
    print()

Topic 0:
worth, buy, money, sale, pay, hour, fun, definitely, cheap, dollar

Topic 1:
like, feel, look, cat, kind, thing, think, sim, epic, similar

Topic 2:
indie, defense, rpg, fps, value, excellent, fan, title, genre, replay

Topic 3:
good, really, pretty, fun, nice, like, cool, play, little, bit

Topic 4:
play, free, old, recommend, amazing, year, highly, fun, addict, school

Topic 5:
like, end, feel, really, thing, look, say, think, play, character

Topic 6:
review, item, dungeon, new, write, shop, buy, sell, thing, loot

Topic 7:
fun, level, hard, challenge, simple, play, easy, platformer, difficult, fast

Topic 8:
ship, card, learn, star, win, ai, fly, curve, car, trading

Topic 9:
play, know, say, want, start, think, thing, buy, let, video

Topic 10:
bad, control, graphic, good, gameplay, physic, terrible, bore, ok, boring

Topic 11:
level, weapon, like, play, different, player, good, character, mode, upgrade

Topic 12:
world, character, experience, different, create, explore, 

Get the most representative docs

Ref: https://stackoverflow.com/questions/63777101/topic-wise-document-distribution-in-gensim-lda

In [15]:
# setup: get the model's topics in their native ordering...
all_topics = lda_model.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]

# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
    # ...get its topics...
    doc_topics = lda_model.get_document_topics(doc_bow)
    # ...& for each of its topics...
    for topic_id, score in doc_topics:
        # ...add the doc_id & its score to the topic's doc list
        docs_per_topic[topic_id].append((doc_id, score))

In [21]:
print(len(docs_per_topic[1]))

357966


In [20]:
docs_per_topic[0][:10]

[(0, 0.043097343),
 (2, 0.1656353),
 (3, 0.034733508),
 (4, 0.012146002),
 (5, 0.049941197),
 (8, 0.2748885),
 (10, 0.07538592),
 (11, 0.025033046),
 (13, 0.082657784),
 (19, 0.38296267)]

In [22]:
for doc_list in docs_per_topic:
    doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)

In [26]:
top_N_docs = 10

for i in range(len(docs_per_topic)):
    print(docs_per_topic[i][:top_N_docs])

[(197643, 0.99840033), (565329, 0.99799997), (563260, 0.9968037), (566553, 0.9958515), (559193, 0.99568117), (562266, 0.99507236), (476357, 0.9519259), (388162, 0.92692024), (267577, 0.92691875), (277440, 0.9208323)]
[(37346, 0.90499544), (680139, 0.88124967), (510547, 0.8812455), (71068, 0.86428493), (39737, 0.86428446), (308754, 0.8642843), (54915, 0.8642838), (279282, 0.86427635), (614039, 0.86427265), (21095, 0.8642596)]
[(180894, 0.99810374), (180236, 0.997983), (182391, 0.99786514), (183268, 0.9955399), (181755, 0.9938709), (180884, 0.99344826), (180421, 0.9924575), (180872, 0.9920168), (184725, 0.9918803), (180342, 0.98782045)]
[(378674, 0.99915934), (154179, 0.9979657), (405401, 0.9828991), (21786, 0.9824074), (75024, 0.9797872), (361483, 0.97889215), (529969, 0.9521481), (284311, 0.9472221), (600484, 0.93213123), (349977, 0.9269201)]
[(478339, 0.9965579), (646966, 0.98999906), (409670, 0.9894017), (409830, 0.9866192), (409761, 0.9820747), (34955, 0.91361576), (42594, 0.8944437

In [27]:
# TODO: use the ID to retrieve the top docs, and copy them to a file for inspection

Test the capability of LDA with LLM topic naming

But before that, we need to find a way to map the corpus id back to the original document ID in the dataset, so that LLM can refer the document, then pass it to the prompt.