# **Topic Identification**
Topic identification is the challenge of automatically finding topics
in a given text. This can be done in supervised and unsupervised ways. For example, an algorithm labels newspaper articles with known topics such
as ”sports,” ”politics,” or ”culture.” In this case, we have predefined topics and labeled training data and could train our model in a supervised way. This is called topic classification. If we do not know the topics in advance and want our algorithm to find clusters of similar topics, we deal with topic modeling or topic discovery, in an unsupervised way [[1]](#scrollTo=1eUuDaNxZ_ms).


This notebook shows examples of unsupervised topic identification with Gensim’s LDA model.

## **Unsupervised topic modeling with Gensim’s LDA model**

Latent Dirichlet Allocation (LDA) is a common technique used for unsupervised topic modeling. This method uses document embeddings, i.e., vector representations of documents. Then the vector’s dimensionality is reduced with techniques such as singular value decomposition (SVD). Unsupervised topic modeling techniques are often used as a preprocessing step for supervised topic identification [[1]](#scrollTo=1eUuDaNxZ_ms).

### Import the ``nltk`` library and download ``wordnet``
``nltk``(Natural Language Toolkit) is an open source Python library for natural language processing. For more details about the ``nltk`` library, please refer to [[2]](https://www.nltk.org/api/nltk.html#nltk.wsd.lesk).

``wordnet`` is a lexical database of semantic relations between words in more than 200 languages. It links words into semantic relations [[3]](https://en.wikipedia.org/wiki/WordNet).

In [1]:
# Import the nltk module
import nltk

# Download the "wordnet" package by using the nltk module
nltk.download('wordnet')

# The module 'RegexpTokenizer' is used to split a string into substrings using regular expressions
from nltk.tokenize import RegexpTokenizer

# The module "punkt" is used to lemmatize the words using WordNet's built-in morphy function
from nltk.stem.wordnet import WordNetLemmatizer

# Download 'omw-1.4' to use Multilingual Wordnet Data from OMW with newer Wordnet versions (December 2021 release)
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### Import ``gensim``
``gensim`` is a Python library for topic modeling. It enables extraction of topics in an unsupervised way using LDA.
For more details about Gensim's LDA model, please refer to [[4]](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html).

In [2]:
# Import gensim
from gensim import corpora, models
import gensim

### Create documents

In [3]:
# Create a sample document list containing two documents
doc_list = [
           "Black holes are dense points in space and they create deep gravity sinks. It is called black hole because beyond a certain region, not even light can escape the powerful tug of a black hole.",
           "The Italian explorer Christopher Columbus officially set foot in the America, and claimed the land for Spain in October 12, 1492. Americans celebrate Columbus Day as a national holiday every year since 1937. This day is celebrated as Columbus Day in the United States, but the name varies on the international spectrum."
           ]

### Tokenize documents

In [4]:
# Define a tokenizer to split each string into substrings
## '\w+' matches one or more alphanumeric characters
tokenizer = RegexpTokenizer(r'\w+')

# Convert text to lowercase and tokenize
for idx in range(len(doc_list)):
    doc_list[idx] = doc_list[idx].lower()  
    doc_list[idx] = tokenizer.tokenize(doc_list[idx])

# Remove numbers, but not words that contain numbers
doc_list = [[token for token in doc if not token.isnumeric()] for doc in doc_list]

# Remove words that are only one character
doc_list = [[token for token in doc if len(token) > 2] for doc in doc_list]

### Create dictionary

In [5]:
# Create a dictionary representation of the document list
dictionary = corpora.Dictionary(doc_list)

### Vectorize documents and create corpus
We compute the frequency of each word and transform documents to a vectorized form.  

In [6]:
# Bag-of-words representation of the documents
corpus = [dictionary.doc2bow(doc) for doc in doc_list]

In [7]:
# Print the number of unique tokens and the number of documents
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 56
Number of documents: 2


### Set training parameters and create model
In this section, we will set the following training parameters:
* ``num_topics``: It presents the number of topics (number of dimensions) and can be freely chosen. For example, if we set this parameter to 10,
we ask our unsupervised clustering algorithm to group our dataset into 10 topics represented as a 10-dimensional vector for each of our documents. In this example, we set ``num_topics = 2`` [[1]](#scrollTo=1eUuDaNxZ_ms).
* ``chunksize``: It controls how many documents are processed at a time in the training algorithm. Increasing the ``chunksize`` speeds up the training process, at least as long as the chunk of documents easily fit into memory. In this example, we set ``chunksize = 10``, which is more than the number of documents, so we process all the data at a single time. ``chunksize`` can influence the quality of the model [[4]](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html).
* ``passes``: It controls how often we train the model on the entire corpus. Another word for passes might be “epochs”. In this example, we set ``passes = 10``.
* ``iterations``: It defines how often we repeat a particular loop over each document [[4]](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html). In this example, we set ``iterations = 400``.
* ``alpha``: It is a parameter that controls the prior distribution over topic weights in each document. In this example, we set  ``alpha='auto'``.
* ``eta``: It is a parameter for the prior distribution over word weights in each topic. In this example, we set  ``eta='auto'``.


In [8]:
# Set training parameters.
num_topics = 2
chunksize = 2
passes = 10
iterations = 400
alpha='auto'
eta='auto'

# Make an "index to word" dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

# Create topic model
model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha=alpha, 
    eta=eta,
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
)

### Print the top 10 topic words
In this section, we use ``show_topics()`` method to show a list of the most 10 probable words for each topic. The words are listed in descending order of their topic-specific probabilities  [[5]](https://mimno.infosci.cornell.edu/papers/mimno-semantic-emnlp.pdf). 

To calculate the topic-specific propability, LDA model uses word distributions for each topic and looks at the words that appear frequently within the topic.

In [33]:
# Use "show_topics()" method to list probability score of each token"
## "num_words" is used to define the number of words to be listed for each topic
## "formattted_True" is used to show topic probability scores as string
## "formatted=False" is used to show topic probability scores as tuple
top_10_topic_words = model.show_topics(num_words=10, formatted=False)

# Print topic scores for all tokens
from pprint import pprint
pprint(top_10_topic_words)

[(0,
  [('the', 0.0592251),
   ('black', 0.053588435),
   ('hole', 0.0377618),
   ('and', 0.032134674),
   ('can', 0.02175805),
   ('gravity', 0.021742798),
   ('even', 0.021730296),
   ('escape', 0.021726195),
   ('powerful', 0.021709928),
   ('create', 0.021709377)]),
 (1,
  [('the', 0.10010287),
   ('columbus', 0.0466471),
   ('day', 0.046641182),
   ('and', 0.028461445),
   ('explorer', 0.019063286),
   ('varies', 0.019056935),
   ('italian', 0.019042756),
   ('set', 0.019030467),
   ('land', 0.019028725),
   ('since', 0.019027684)])]


### Print 3-word topic representations
In this section, we print the top 3 words which represent each topic.

In [29]:
# Print the top 3 word tokens representing each topic
for index, topic in model.show_topics(formatted=False, num_words= 3):
    print('Topic: {} \nWords: {}'.format(index, ' '.join([w[0] for w in topic])))

Topic: 0 
Words: the black hole
Topic: 1 
Words: the columbus day


# **References**

- [1] Course Book "NLP and Computer Vision" (DLMAINLPCV01)
- [2] https://www.nltk.org/api/nltk.html#nltk.wsd.lesk
- [3] https://en.wikipedia.org/wiki/WordNet
- [4] https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html
- [5] https://mimno.infosci.cornell.edu/papers/mimno-semantic-emnlp.pdf

Copyright © 2022 IU International University of Applied Sciences