### TUTORIAL: Train LDA with OCTIS

Welcome! This is a tutorial that allows you to train a topic model using OCTIS (Optimizing and Comparing Topic Models Is Simple). 

![](https://github.com/MIND-Lab/OCTIS/blob/master/logo.png?raw=true)

A topic model allows you to discover the latent topics in your documents in a completely unsupervised way. Just use your documents and get topics out! It's very easy with OCTIS :)

Let's start! First, we need to install OCTIS. (We are going to use the library version of OCTIS, but you can also use it through its dashboard. See https://github.com/mind-Lab/octis for more details.)

In [None]:
!pip install octis

^C


You should consider upgrading via the 'C:\Users\feder\OneDrive\Documenti\Fede\Unimib\TESI\OCTIS\OCTIS_editbyFR\OCTIS_py_env\Scripts\python.exe -m pip install --upgrade pip' command.


Collecting octis
  Downloading octis-1.14.0-py2.py3-none-any.whl (130 kB)
Installing collected packages: octis
Successfully installed octis-1.14.0


Let's import what we need. 

In [1]:
from octis.models.LDA import LDA
from octis.dataset.dataset import Dataset
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.coherence_metrics import Coherence

We need some data to run a topic model. OCTIS already provides 4 already-preprocessed datasets. Let's use one of them.

In [2]:
# Define dataset
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup")

And now we need a model. We are going to use LDA because it is the most well-known, but OCTIS integrates other 8 topic model (including neural topic models!). 

We are going to set the number of topics to 20 and the hyperparameter alpha to 0.1. If you have no idea how to set your hyperparameters, you should definitely use OCTIS's optimization module. See this other tutorial for the optimization of hyperparameters: (link)

In [3]:
# Create Model
model = LDA(num_topics=20, alpha=0.1)

Now we're ready to train it. See that the output of a topic model comes as a dictionary composed of 4 elements:


*   *topics*: the list of word topics
*   *topic-word-matrix*: the distribution of the words of the vocabulary for each topic (dimensions: |num topics| x |vocabulary|)
*   *topic-document-matrix*: the distribution of the topics for each document of the training set (dimensions: |num topics| x |training documents|)
*   *test-document-topic-matrix*: the distribution of the topics for each document of the testing set (dimensions: |num topics| x |test documents|)



In [4]:
# Train the model using default partitioning choice 
output = model.train_model(dataset)

print(*list(output.keys()), sep="\n") # Print the output identifiers

topic-word-matrix
topics
topic-document-matrix
test-topic-document-matrix


In [27]:
from scipy import stats

#(output['topics'][0]) # Print the words inside the first topic
#type(output)
print(
'oggetti nell\'output: \n',
output.keys(), '\n',# Print the keys of the output dictionary
'topic contents: \n',
len(output['topics']), '\n', # Print the number of topics
len(output['topics'][0]), '\n', # Print the number of words in the first
output['topics'][0], '\n', # Print the words in the first topic 
'topic word matrix content: \n',
type(output['topic-word-matrix']), '\n',
output['topic-word-matrix'].shape, '\n', # Print the shape of the topic-word matrix
output['topic-word-matrix'][0].shape, '\n', # Print the shape of the first topic-word vector
output['topic-word-matrix'][0], '\n', # Print the first topic-word vector   
output['topic-word-matrix'][0][0:10], '\n', # Print the first 10 elements of the first topic-word vector
'topic-document-matrix', '\n',
type(output['topic-document-matrix']), '\n',
output['topic-document-matrix'].shape, '\n', # Print the shape of the topic-document matrix
output['topic-document-matrix'][0].shape, '\n', # Print the shape of
output['topic-document-matrix'][0][1:10], '\n', # Print the first 10 elements of the first topic-document vector
'test-topic-document-matrix', '\n', 
type(output['test-topic-document-matrix']),
'\n',
output['test-topic-document-matrix'].shape, '\n', # Print the shape of the
output['test-topic-document-matrix'][0].shape, '\n', # Print the shape of the first test topic-document vector
stats.describe(output['test-topic-document-matrix'][0][1:100]), '\n', # Print
)

oggetti nell'output: 
 dict_keys(['topic-word-matrix', 'topics', 'topic-document-matrix', 'test-topic-document-matrix']) 
 topic contents: 
 20 
 10 
 ['chip', 'launch', 'encryption', 'key', 'clipper', 'system', 'phone', 'government', 'space', 'technology'] 
 topic word matrix content: 
 <class 'numpy.ndarray'> 
 (20, 1612) 
 (1612,) 
 [1.3149755e-04 7.0876325e-04 1.1687994e-03 ... 1.7015223e-06 1.2614160e-05
 2.1259667e-04] 
 [1.3149755e-04 7.0876325e-04 1.1687994e-03 1.9049755e-03 1.4142671e-03
 8.2880119e-04 2.8464169e-04 9.7310147e-04 1.1821539e-04 7.1063907e-05] 
 topic-document-matrix 
 <class 'numpy.ndarray'> 
 (20, 11415) 
 (11415,) 
 [0.01250086 0.00227321 0.01250148 0.0029421  0.00526474 0.00263269
 0.00088532 0.00384719 0.00169553] 
 test-topic-document-matrix 
 <class 'numpy.ndarray'> 
 (20, 2447) 
 (2447,) 
 DescribeResult(nobs=99, minmax=(0.0, 0.6736658215522766), mean=0.021899297767591596, variance=0.007183942437923547, skewness=5.506752069358881, kurtosis=35.32417930145

For  examples, these are a sample of 5 topics. Do you think they make sense?

In [5]:
for t in output['topics'][:5]:
  print(" ".join(t))

chip launch encryption key clipper system phone government space technology
agent batf audio cpu channel radio power switch define police
gun weapon firearm law criminal control issue police carry officer
drive keyboard speed car driver back work test month good
key bit block book number time serial battery encrypt temperature


To check if the topics are coherent, we can use a topic coherence measure. The most used one is NPMI and it is available in OCTIS. We are going to use the dataset itself to compute it. 

In [36]:
# Initialize metric
npmi = Coherence(texts=dataset.get_corpus(), topk=10, measure='c_npmi')
type(npmi)

octis.evaluation_metrics.coherence_metrics.Coherence

In [38]:
def getattrnames(obj):
  try:
    names = obj.__dict__.keys()
  except:
    try:
      names = dir(obj)
    except:
      try:
        names = vars(obj).keys()
      except:
        print('names stored in a strange way')
        names = None
  names = [name for name in names if not name.startswith('_')]
  return names

getattrnames(npmi)

['topk', 'processes', 'measure']

Or we can test if the resulting topics are different from each other. The `TopicDiversity` measure computes the number of unique words in the top-words of the resulting topics. 



In [39]:
# Initialize metric
topic_diversity = TopicDiversity(topk=10)
getattrnames(topic_diversity)

['topk']

And with the method `score`, we can get their actual evaluation score. Just use the output of the topic model as input of the method.

In [8]:
# Retrieve metrics score
topic_diversity_score = topic_diversity.score(output)
print("Topic diversity: "+str(topic_diversity_score))

npmi_score = npmi.score(output)
print("Coherence: "+str(npmi_score))

Topic diversity: 0.7
Coherence: 0.049008584754555204
