# Topic Modeling

This notebook will document step by step different approaches to group documents in different topics using the **topic modeling** strategy.

## What exactly is Topic Modeling?

**Topic Modeling** is a technique used to discover the **distribution of uderlying topics** in a collection of documents.

Each topic is a collection of co-occuring words in a set of documents. The order of the words is not taken into account.

## Two different Topic Modeling algorithms

We will test 2 different algorithms with this set of documents.

In [1]:
# List of texts simulating documents' abstracts

docs = [
    'The universe is a vast expanse of space containing countless galaxies, stars, planets, and other celestial objects.',
    'Ancient civilizations such as the Egyptians, Greeks, and Romans have left behind rich legacies of art, architecture, and knowledge.',
    'Climate change is a pressing global issue that requires urgent action to mitigate its impacts on the environment and human societies.',
    'The rise of artificial intelligence has led to both excitement and concern about its potential to revolutionize industries and transform society.',
    'Renewable energy sources like solar and wind power are increasingly being adopted as alternatives to fossil fuels to combat climate change.',
    'The human brain is a complex organ responsible for controlling thoughts, emotions, movements, and bodily functions.',
    'Cultural diversity enriches societies by fostering tolerance, understanding, and appreciation of different traditions, languages, and perspectives.',
    'Globalization has connected people, cultures, and economies around the world, leading to both opportunities and challenges.',
    'The history of mathematics spans thousands of years and includes the development of algebra, geometry, calculus, and other branches.',
    'Artificial neural networks are computational models inspired by the structure and function of biological neural networks, used in various applications like image recognition and natural language processing.'
]

In [2]:
# We make available the packages inside the topic_modeling folder

import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))

### Latent Dirchlet Allocation (LDA)

This algorithm is the most popular given its speed. It calculates probability distributions and uses bag-of-words.

Let's try to test **LDA**.

In [5]:
from topic_modeling.lda import LDA

# Create a LDA specifying the number of topics and the number of words to include in each topic
# By default:
# - num_topics= 3
# - num_words = 5
lda_model = LDA(corpus=docs, num_topics=3, num_words=5)

# Fit the model to the documents
lda_model.fit()

In [6]:
# Print the generated topics
print(f'\n> Generated {len(lda_model.topics)} topics:')
for i, topic in enumerate(lda_model.topics):
    print(f'> Topic {i}: {topic}')


> Generated 3 topics:
> Topic 0: ['and', 'of', 'the', 'networks', 'neural']
> Topic 1: ['and', 'to', 'the', 'its', 'has']
> Topic 2: ['of', 'the', 'and', 'other', 'is']


In [7]:
# Obtain the topic distribution for one of the documents
document = 'The universe is a vast expanse of space containing countless galaxies, stars, planets, and other celestial objects.'
probs = lda_model.predict(document)

# Print the topic distribution for the new document
print('\n> Topic distribution for the new document:')
print(f'> Document: {document}')
for i, prob in enumerate(probs):
    print(f'> Topic {i}: {prob}')


> Topic distribution for the new document:
> Document: The universe is a vast expanse of space containing countless galaxies, stars, planets, and other celestial objects.
> Topic 0: 0.02021521411040529
> Topic 1: 0.02024417640495833
> Topic 2: 0.9595406094846364


In [8]:
# Print the coherence of the model
coherence = lda_model.calculate_coherence()
print(f'> Coherence of the model: {coherence}')

> Coherence of the model: -0.5426636917742381


### BERTopic


This algorithm is based on **B**idirectional **E**ncoder **R**epresentations from **T**ransformers (BERT).

It is able to also capture the context of words.