# OCTIS Models Evaluation

## Prologue & Imports

We will evaluate the performance of most relevant OCTIS models as a baseline for non-SOTA Topic Modeling. These models will be compared on the same preprocessed dataset, the same number of topics and the same evaluation metrics.

In [1]:
from octis.models.LSI import LSI
from octis.models.NMF import NMF
from octis.models.LDA import LDA
from octis.models.HDP import HDP
from octis.models.NeuralLDA import NeuralLDA
from octis.models.ProdLDA import ProdLDA
from octis.dataset.dataset import Dataset
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.diversity_metrics import TopicDiversity, KLDivergence
from octis.evaluation_metrics.similarity_metrics import RBO, PairwiseJaccardSimilarity
from octis.evaluation_metrics.topic_significance_metrics import KL_uniform

from spacy.lang.el.stop_words import STOP_WORDS as el_stop
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

from utils.data_loader import GreekPMDataloader
from models.octis.utils.preprocessor_gr import GreekStanzaPreprocessor
from models.octis.config.preprocessing import preprocessor_gr_params
from models.octis.config.models import NUM_TOPICS, lsi_params, nmf_params, lda_params, hdp_params, neural_lda_params, prod_lda_params
from models.octis.config.optimization import OPTIMIZATION_RESULT_PATH, TOP_K, NUM_PROCESSES, MODEL_RUNS, search_space
from models.octis.utils.model_evaluator import OCTISModelEvaluator

import pandas as pd

2024-04-06 18:40:14 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

2024-04-06 18:40:14 INFO: Downloaded file to /Users/dion/stanza_resources/resources.json
2024-04-06 18:40:14 INFO: Loading these models for language: el (Greek):
| Processor | Package                 |
---------------------------------------
| tokenize  | gdt                     |
| mwt       | gdt                     |
| pos       | models/oct..._tagger.pt |
| lemma     | models/oct...matizer.pt |

2024-04-06 18:40:14 INFO: Using device: cpu
2024-04-06 18:40:14 INFO: Loading: tokenize
2024-04-06 18:40:15 INFO: Loading: mwt
2024-04-06 18:40:15 INFO: Loading: pos
2024-04-06 18:40:15 INFO: Loading: lemma
2024-04-06 18:40:15 INFO: Done loading processors!


## Dataset Loading

Our dataset has already been preprocessed in the `analysis` notebook, so we will load it directly.

In [2]:
dataset = Dataset()
dataset.load_custom_dataset_from_folder('models/octis/data/dataset')
print("Dataset found cached - loading...")

Dataset not found in cache - loading...
Preprocessing data...


  0%|          | 0/2033 [00:00<?, ?it/s]

2024-04-06 18:40:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
2024-04-06 18:40:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
2024-04-06 18:40:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
2024-04-06 18:40:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
2024-04-06 18:40:19 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with d

Dataset preprocessed and saved!


In [3]:
corpus = dataset.get_corpus()

## Evaluation Metrics

In [4]:
coherence_npmi = Coherence(texts=corpus, topk=TOP_K, processes=NUM_PROCESSES, measure='c_npmi')
coherence_cv = Coherence(texts=corpus, topk=TOP_K, processes=NUM_PROCESSES, measure='c_v')
coherence_umass = Coherence(texts=corpus, topk=TOP_K, processes=NUM_PROCESSES, measure='u_mass')
coherence_uci = Coherence(texts=corpus, topk=TOP_K, processes=NUM_PROCESSES, measure='c_uci')

diversity_topic = TopicDiversity(topk=TOP_K)
diversity_kl = KLDivergence()

similarity_rbo = RBO(topk=TOP_K)
similarity_pjs = PairwiseJaccardSimilarity()

significance_kluni = KL_uniform()

other_metrics = [coherence_npmi, coherence_umass, coherence_uci, diversity_topic, diversity_kl, similarity_rbo, similarity_pjs, significance_kluni]

In [5]:
metrics = {"coherence_npmi": coherence_npmi, "coherence_cv": coherence_cv, "coherence_umass": coherence_umass, "coherence_uci": coherence_uci, "diversity_topic": diversity_topic, "diversity_kl": diversity_kl, "similarity_rbo": similarity_rbo, "similarity_pjs": similarity_pjs, "significance_kluni": significance_kluni}

## Model Initialization

In [6]:
lsi_model = LSI(**lsi_params)
lda_model = LDA(**lda_params)
hdp_model = HDP(**hdp_params)
nmf_model = NMF(**nmf_params)
neural_lda_model = NeuralLDA(**neural_lda_params)
prod_lda_model = ProdLDA(**prod_lda_params)

In [7]:
models = {"lsi": lsi_model, "lda": lda_model, "hdp": hdp_model, "nmf": nmf_model, "neural_lda": neural_lda_model, "prod_lda": prod_lda_model}

## Evaluation

In [8]:
evaluator = OCTISModelEvaluator(dataset=dataset, 
                                models=models,
                                metrics=metrics,
                                topics=NUM_TOPICS,
                            )

In [9]:
evaluator.evaluate()

Epoch: [1/200]	Samples: [1439/287800]	Train Loss: 3575.99102023975	Time: 0:00:00.177247
Epoch: [1/200]	Samples: [160/32000]	Validation Loss: 1167837.0560546876	Time: 0:00:00.005994
Epoch: [2/200]	Samples: [2878/287800]	Train Loss: 3376.282645283183	Time: 0:00:00.120058
Epoch: [2/200]	Samples: [160/32000]	Validation Loss: 43017.15424804688	Time: 0:00:00.006124
Epoch: [3/200]	Samples: [4317/287800]	Train Loss: 3360.946257166435	Time: 0:00:00.126054
Epoch: [3/200]	Samples: [160/32000]	Validation Loss: 1322247197.6279786	Time: 0:00:00.006747
Epoch: [4/200]	Samples: [5756/287800]	Train Loss: 3290.535061240445	Time: 0:00:00.121633
Epoch: [4/200]	Samples: [160/32000]	Validation Loss: 35319.64255371094	Time: 0:00:00.005842
Epoch: [5/200]	Samples: [7195/287800]	Train Loss: 3251.2212473940235	Time: 0:00:00.130142
Epoch: [5/200]	Samples: [160/32000]	Validation Loss: 5965.650048828125	Time: 0:00:00.006255
Epoch: [6/200]	Samples: [8634/287800]	Train Loss: 3267.123164958304	Time: 0:00:00.129418
Epoc

  self.evaluation_df = pd.concat([self.evaluation_df, pd.DataFrame(model_metric_data)], ignore_index=True)
  divergence = np.sum(P*np.log(P/Q))
  divergence = np.sum(P*np.log(P/Q))


Unnamed: 0,model,coherence_npmi,coherence_cv,coherence_umass,coherence_uci,diversity_topic,diversity_kl,similarity_rbo,similarity_pjs,significance_kluni
0,lsi,0.016515,0.578025,-1.510241,-1.091888,0.566667,0.383912,0.054179,0.035555,0.190096
1,lda,0.13049,0.678527,-1.265472,0.358107,0.826667,2.359925,0.012845,0.011574,1.560585
2,hdp,-0.062434,0.490024,-2.316952,-3.00838,0.542667,0.361988,0.016444,0.013665,0.213347
3,nmf,0.07209,0.618584,-1.250158,-0.343318,0.606667,4.024918,0.041427,0.033569,2.102304
4,neural_lda,-0.037495,0.505895,-2.013145,-1.39514,0.96,1.144312,0.001852,0.002355,0.681478
5,prod_lda,-0.04074,0.618758,-2.785769,-3.756963,0.906667,,0.004984,0.00576,
