# Topic Modeling

This notebook will document step by step different approaches to group documents in different topics using the **topic modeling** strategy.

## What exactly is Topic Modeling?

**Topic Modeling** is a technique used to discover the **distribution of uderlying topics** in a collection of documents.

Each topic is a collection of co-occuring words in a set of documents. The order of the words is not taken into account.

## Two different Topic Modeling algorithms

We will test 2 different algorithms with this set of documents.

In [2]:
# List of texts simulating documents' abstracts

docs = [
    'The universe is a vast expanse of space containing countless galaxies, stars, planets, and other celestial objects.',
    'Ancient civilizations such as the Egyptians, Greeks, and Romans have left behind rich legacies of art, architecture, and knowledge.',
    'Climate change is a pressing global issue that requires urgent action to mitigate its impacts on the environment and human societies.',
    'The rise of artificial intelligence has led to both excitement and concern about its potential to revolutionize industries and transform society.',
    'Renewable energy sources like solar and wind power are increasingly being adopted as alternatives to fossil fuels to combat climate change.',
    'The human brain is a complex organ responsible for controlling thoughts, emotions, movements, and bodily functions.',
    'Cultural diversity enriches societies by fostering tolerance, understanding, and appreciation of different traditions, languages, and perspectives.',
    'Globalization has connected people, cultures, and economies around the world, leading to both opportunities and challenges.',
    'The history of mathematics spans thousands of years and includes the development of algebra, geometry, calculus, and other branches.',
    'Artificial neural networks are computational models inspired by the structure and function of biological neural networks, used in various applications like image recognition and natural language processing.'
]

### Latent Dirchlet Allocation (LDA)

This algorithm is the most popular given its speed. It calculates probability distributions and uses bag-of-words.

Let's try to test **LDA**.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation


class LDATopicModel:
    """
    A class to apply Latent Dirichlet Allocation (LDA) to a corpus of documents and extract topics.
    """
    def __init__(self, corpus: list[str], num_topics: int=5):
        """
        Initialize the LDA topic model with the given corpus of documents and number of topics.
        """
        self.corpus = corpus
        self.num_topics = num_topics
        self.vectorizer: CountVectorizer = None
        self.model: LatentDirichletAllocation = None
        self.topics: list[list[str]] = None

    def fit(self):
        """
        Fit the LDA model to the corpus of documents and extract the top words for each topic.
        """
        self.vectorizer = CountVectorizer()
        X = self.vectorizer.fit_transform(self.corpus)

        self.model = LatentDirichletAllocation(n_components=self.num_topics, random_state=0)
        self.model.fit(X)

        self.topics = self.__get_topics()

    def predict(self, doc: str):
        """
        Predict the topic distribution for a new document.
        """
        X = self.vectorizer.transform([doc])
        return self.model.transform(X)[0]
    
    def __get_topics(self, n_words=5):
        """
        Get the top words for each topic in the LDA model.
        """
        feature_names = self.vectorizer.get_feature_names_out()
        topics = []
        for topic in self.model.components_:
            topics.append([feature_names[i] for i in topic.argsort()[:-n_words-1:-1]])
        return topics

In [None]:
# Create an instance of the LDATopicModel class and fit it to the corpus of documents
lda_model = LDATopicModel(docs, num_topics=3)
lda_model.fit()

# Print the generated topics
print(f'> Generated {len(lda_model.topics)} topics:')
for i, topic in enumerate(lda_model.topics):
    print(f'> Topic {i}: {topic}')

# Predict the topic distribution for a new document
lda_model.predict('The universe is a vast expanse of space containing countless galaxies, stars, planets, and other celestial objects.')

In [None]:
# Create a function to generate all the topics
def generate_topics(count_vec: CountVectorizer, lda: LatentDirichletAllocation) -> list[list[str]]:
    """
    Generate and return a list of topics given an already trained CountVectorizer and a LDA.
    """
    result = []
    feature_names = count_vec.get_feature_names_out()
    for topic in lda.components_:
        result.append([feature_names[i] for i in topic.argsort()[:-5:-1]])
    return result

# Generate the topics
generate_topics(count_vec, lda)

In [None]:
# Create a function that calculates the probability of a text to belong to each topic
def calculate_probablities_for_each_topic(
        text: str,
        count_vec: CountVectorizer,
        lda: LatentDirichletAllocation) -> list[float]:
    """
    Calculate the probability of a text to belong to each topic.
    """
    bag_of_words = count_vec.transform([text])
    topic_dist = lda.transform(bag_of_words)
    return [p for p in topic_dist[0]]

# Calculate the probability of a text to belong to each topic

### BERTopic


This algorithm is based on **Bidirectional Encoder Representations from Transformers** (**BERT**).

It is able to also capture the context of words.