## Latent Dirichlet Allocation

In [2]:
from octis.models.LDA import LDA
from octis.dataset.dataset import Dataset
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.optimization.optimizer import Optimizer
from skopt.space.space import Real

In [3]:
# Load the dataset
dataset = Dataset()  
dataset.fetch_dataset("20NewsGroup")
#dataset.load_custom_dataset_from_folder("data/processed") # Our custom preprocessed dataset


Make sure that the dataset is in the following format:
corpus file: a .tsv file (tab-separated) that contains up to three columns, i.e. the document, the partitition, and the label associated to the document (optional).
vocabulary: a .txt file where each line represents a word of the vocabulary
The partition can be "train" for the training partition, "test" for testing partition, or "val" for the validation partition. An example of dataset can be found here: sample_dataset.



In [4]:
# Create Model
model = LDA(num_topics=20, alpha=0.1)

In [None]:


# Define the search space. To see which hyperparameters to optimize, see the topic model's initialization signature
search_space = {"alpha": Real(low=0.001, high=5.0), "eta": Real(low=0.001, high=5.0)}

# Initialize an optimizer object and start the optimization.
optimizer=Optimizer()
optResult=optimizer.optimize(model, dataset, eval_metric, search_space, save_path="../results" # path to store the results
                             number_of_call=30, # number of optimization iterations
                             model_runs=5) # number of runs of the topic model
#save the results of th optimization in a csv file
optResult.save_to_csv("results.csv")

Now we're ready to train it. See that the output of a topic model comes as a dictionary composed of 4 elements:


*   *topics*: the list of word topics
*   *topic-word-matrix*: the distribution of the words of the vocabulary for each topic (dimensions: |num topics| x |vocabulary|)
*   *topic-document-matrix*: the distribution of the topics for each document of the training set (dimensions: |num topics| x |training documents|)
*   *test-document-topic-matrix*: the distribution of the topics for each document of the testing set (dimensions: |num topics| x |test documents|)



In [5]:
# Train the model using default partitioning choice
output = model.train_model(dataset)

print(*list(output.keys()), sep="\n") # Print the output identifiers



topic-word-matrix
topics
topic-document-matrix
test-topic-document-matrix


For  examples, these are a sample of 5 topics. Do you think they make sense?

In [6]:
for t in output['topics'][:5]:
  print(" ".join(t))

science theory evidence disease case make scientific fact green sound
key encryption clipper phone government chip law public system enforcement
widget application font window version problem machine find set library
make people time government find thing fire bit greek problem
information list include paper report file make mail article address


To check if the topics are coherent, we can use a topic coherence measure. The most used one is NPMI and it is available in OCTIS. We are going to use the dataset itself to compute it.

In [7]:
# Initialize metric
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi')

Or we can test if the resulting topics are different from each other. The `TopicDiversity` measure computes the number of unique words in the top-words of the resulting topics.



In [8]:
# Initialize metric
topic_diversity = TopicDiversity(topk=10)

And with the method `score`, we can get their actual evaluation score. Just use the output of the topic model as input of the method.

In [9]:
# Retrieve metrics score
topic_diversity_score = topic_diversity.score(output)
print("Topic diversity: "+str(topic_diversity_score))

npmi_score = npmi.score(output)
print("Coherence: "+str(npmi_score))

Topic diversity: 0.69
Coherence: 0.0669094052392269
