# **Topic Modeling**
Topic identification is the challenge of automatically finding topics
in a given text. This can be done in supervised and unsupervised ways. For example, an algorithm labels newspaper articles with known topics such
as ”sports,” ”politics,” or ”culture.” In this case, we have predefined topics and labeled training data and could train our model in a supervised way. This is called topic classification. If we do not know the topics in advance and want our algorithm to find clusters of similar topics, we deal with topic modeling or topic discovery, in an unsupervised way [[1]](#scrollTo=1eUuDaNxZ_ms).


This notebook shows examples of unsupervised topic modeling with Gensim’s LDA model.

## **Unsupervised topic modeling with Gensim’s LDA model**

Latent Dirichlet Allocation (LDA) is a common technique used for unsupervised topic modeling. This method uses document embeddings, i.e., vector representations of documents. Then the vector’s dimensionality is reduced with techniques such as singular value decomposition (SVD). Unsupervised topic modeling techniques are often used as a preprocessing step for supervised topic identification [[1]](#scrollTo=1eUuDaNxZ_ms).

### Import the ``nltk`` library and download ``wordnet``
``nltk``(Natural Language Toolkit) is an open source Python library for natural language processing. For more details about the ``nltk`` library, please refer to [[2]](https://www.nltk.org/api/nltk.html#nltk.wsd.lesk).

``wordnet`` is a lexical database of semantic relations between words in more than 200 languages. It links words into semantic relations [[3]](https://en.wikipedia.org/wiki/WordNet).

In [None]:
# Import the "nltk" module
import nltk

# Download the "wordnet" package by using the "nltk" module
nltk.download('wordnet')

# The module '"RegexpTokenizer" is used to split a string into substrings using regular expressions
from nltk.tokenize import RegexpTokenizer

# The module "punkt" is used to lemmatize the words using WordNet's built-in morphy function
from nltk.stem.wordnet import WordNetLemmatizer

# Download 'omw-1.4' to use Multilingual Wordnet Data from OMW with newer Wordnet versions (December 2021 release)
nltk.download('omw-1.4')

# We import Python "pprint" library to print complex data structures in an easy to read format
## For more details about the difference between print() and pprint() functions, please refer to [4]
from pprint import pprint

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Import ``gensim``
``gensim`` is a Python library for topic modeling. It enables extraction of topics in an unsupervised way using LDA.
For more details about Gensim's LDA model, please refer to [[5]](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html).

In [None]:
# Import gensim
from gensim import corpora, models
import gensim

### Create documents

In [None]:
# Create a sample document list containing two documents
doc_list = [
           "Black holes are dense points in space and they create deep gravity sinks. It is called black hole because beyond a certain region, not even light can escape the powerful tug of a black hole.",
           "The Italian explorer Christopher Columbus officially set foot in the America, and claimed the land for Spain in October 12, 1492. Americans celebrate Columbus Day as a national holiday every year since 1937. This day is celebrated as Columbus Day in the United States, but the name varies on the international spectrum."
           ]

### Tokenize documents

In [None]:
# Define a tokenizer to split each string into substrings
## '\w+' matches one or more alphanumeric characters
tokenizer = RegexpTokenizer(r'\w+')

# Convert text to lowercase and tokenize
for idx in range(len(doc_list)):
    doc_list[idx] = doc_list[idx].lower()  
    doc_list[idx] = tokenizer.tokenize(doc_list[idx])

# Remove numbers, but not words that contain numbers
doc_list = [[token for token in doc if not token.isnumeric()] for doc in doc_list]

# Remove tokens that are only one character
doc_list = [[token for token in doc if len(token) > 2] for doc in doc_list]

# Print tokens
pprint(doc_list)

[['black',
  'holes',
  'are',
  'dense',
  'points',
  'space',
  'and',
  'they',
  'create',
  'deep',
  'gravity',
  'sinks',
  'called',
  'black',
  'hole',
  'because',
  'beyond',
  'certain',
  'region',
  'not',
  'even',
  'light',
  'can',
  'escape',
  'the',
  'powerful',
  'tug',
  'black',
  'hole'],
 ['the',
  'italian',
  'explorer',
  'christopher',
  'columbus',
  'officially',
  'set',
  'foot',
  'the',
  'america',
  'and',
  'claimed',
  'the',
  'land',
  'for',
  'spain',
  'october',
  'americans',
  'celebrate',
  'columbus',
  'day',
  'national',
  'holiday',
  'every',
  'year',
  'since',
  'this',
  'day',
  'celebrated',
  'columbus',
  'day',
  'the',
  'united',
  'states',
  'but',
  'the',
  'name',
  'varies',
  'the',
  'international',
  'spectrum']]


### Create dictionary
We use ``Dictionary()`` method to create mapping between tokens and their IDs.

In [None]:
# Create a dictionary representation of the "doc_list""
dictionary = corpora.Dictionary(doc_list)

# Print keys and values of the "dictionary"
for key, value in dictionary.items():
  print(key, " : ", value)


0  :  and
1  :  are
2  :  because
3  :  beyond
4  :  black
5  :  called
6  :  can
7  :  certain
8  :  create
9  :  deep
10  :  dense
11  :  escape
12  :  even
13  :  gravity
14  :  hole
15  :  holes
16  :  light
17  :  not
18  :  points
19  :  powerful
20  :  region
21  :  sinks
22  :  space
23  :  the
24  :  they
25  :  tug
26  :  america
27  :  americans
28  :  but
29  :  celebrate
30  :  celebrated
31  :  christopher
32  :  claimed
33  :  columbus
34  :  day
35  :  every
36  :  explorer
37  :  foot
38  :  for
39  :  holiday
40  :  international
41  :  italian
42  :  land
43  :  name
44  :  national
45  :  october
46  :  officially
47  :  set
48  :  since
49  :  spain
50  :  spectrum
51  :  states
52  :  this
53  :  united
54  :  varies
55  :  year


### Create bag-of-words representation
We use ``doc2bow()`` method to convert ``dictionary`` into the bag-of-words format. This method computes the frequency of each word and transforms documents to a vectorized form. 

In [None]:
# Create bag-of-words representation of the documents 
corpus = [dictionary.doc2bow(doc) for doc in doc_list]

# Print "corpus"
pprint(corpus)

[[(0, 1),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 3),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 2),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1)],
 [(0, 1),
  (23, 6),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 1),
  (33, 3),
  (34, 3),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1),
  (52, 1),
  (53, 1),
  (54, 1),
  (55, 1)]]


### Set training parameters and create model
In this section, we set the following training parameters:
* ``num_topics``: It presents the number of topics (number of dimensions) and can be freely chosen. In this example, we set ``num_topics = 2``. It means we ask our unsupervised clustering algorithm to group our dataset into 2 topics. [[1]](#scrollTo=1eUuDaNxZ_ms).
* ``chunksize``: It controls how many documents are processed at a time in the training algorithm. Increasing the ``chunksize`` speeds up the training process. In this example, we set ``chunksize = 10``, which is more than the number of documents, so we process all data at a single time. ``chunksize`` can influence the quality of the model [[5]](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html).
* ``passes``: It controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. In this example, we set ``passes = 10``.
* ``iterations``: It defines how often we repeat a particular loop over each document [[5]](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html). In this example, we set ``iterations = 400``.
* ``alpha``: It is a hyperparameter that controls the prior probability distribution over topic weights in each document. In this example, we set  ``alpha='auto'``.
* ``eta``: It is a hyperparameter that controls the prior probability distribution over word weights in each topic. In this example, we set  ``eta='auto'``.


In [None]:
# Set training parameters
num_topics = 2
chunksize = 10
passes = 10
iterations = 400
alpha='auto'
eta='auto'

# Create a dictionary "id2word"
temp = dictionary[0]  # This is only to load the dictionary
id2word = dictionary.id2token

# Create topic model
model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha=alpha, 
    eta=eta,
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
)

### Print the top 10 word tokens
In this section, we use ``show_topics()`` method to list the most 10 probable tokens of each topic. The tokens are listed in descending order of their topic-specific probabilities  [[6]](https://mimno.infosci.cornell.edu/papers/mimno-semantic-emnlp.pdf). Gensim's LDA model uses token frequency data of each document to calculate the topic-specific propability. The sum of all topic-specific probabilities is 1.



In [None]:
# Use "show_topics()" method to list topic-specific probability of each token
## "num_words" is used to define the number of tokens to be listed for each topic
## Set "formattted=True" to list tokens and their topic-specific probability scores as string
## Set "formatted=False" to list as tuple
top_10_word_tokens = model.show_topics(num_words=10, formatted=False)

# Print the top 10 word tokens with their topic-specific probabilities
pprint(top_10_word_tokens)

[(0,
  [('the', 0.059344474),
   ('black', 0.05311735),
   ('hole', 0.03787877),
   ('and', 0.03203366),
   ('powerful', 0.021743223),
   ('tug', 0.021655953),
   ('can', 0.021655044),
   ('create', 0.021637557),
   ('space', 0.021615326),
   ('called', 0.021614939)]),
 (1,
  [('the', 0.09992744),
   ('day', 0.046697255),
   ('columbus', 0.046594404),
   ('and', 0.028399862),
   ('foot', 0.019078067),
   ('explorer', 0.019069921),
   ('national', 0.019069491),
   ('united', 0.019044688),
   ('officially', 0.019032152),
   ('states', 0.019023782)])]


### Print a topic representation
In this section, we choose 3 tokens which have the highest topic-specific probabilities and print them as topic representations of each topic.

In [None]:
# Print the top 3 word tokens representing each topic
for index, topic in model.show_topics(formatted=False, num_words= 3):
    print('Topic: {} \nWords: {}'.format(index, ' '.join([w[0] for w in topic])))

Topic: 0 
Words: the black hole
Topic: 1 
Words: the day columbus


# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://www.nltk.org/api/nltk.html#nltk.wsd.lesk
- [3] https://en.wikipedia.org/wiki/WordNet
- [4] https://docs.python.org/3/library/pprint.html
- [5] https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html
- [6] https://mimno.infosci.cornell.edu/papers/mimno-semantic-emnlp.pdf

Copyright © 2022 IU International University of Applied Sciences