<a href="https://colab.research.google.com/github/RajarajachozhanVK/RajarajachozhanVK/blob/main/Latent_Dirichlet_Allocation_(LDA)_with_Gibbs_sampling_algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ***Latent Dirichlet Allocation (LDA) with Gibbs sampling algorithm ***

1. Introduction
Latent Dirichlet Allocation (LDA) with Gibbs sampling is a popular probabilistic model used in
natural language processing and machine learning for topic modeling.

1.1 Notations and Definitions
e Documents: D = dy,ds, ..., dy, where each d,,, represents a document.
* Words in Documents: W = w,, ,, where w,y, ,,, denotes the n-th word in the m-th document.
e Vocabulary: V= v;,vs,. .., vy, the set of all unique words across all documents.
» Topics: K, the number of topics assumed in the corpus.
« Topic Distribution: 6,,, the topic distribution for document d,, .
» Topic Assignment: z,,, the topic assignment for the n-th word in document d,, .
* Topic-Word Distribution: ¢, , the distribution of words for topic k.
1.2 Model Assumptions
« Dirichlet Priors: LDA assumes Dirichlet priors for topic distributions 6,,, and topic-word
distributions ¢y
* Exchangeability: Words are exchangeable within documents, meaning the order of words
does not affect the underlying topic distribution.
« 1.3 Generative Process LDA posits a generative process for each document d,, :
 For each document d,,, :
o Draw topic distribution 8,,, ~ Dirichlet(c).
o For each word Wy,
= Draw topic assignment z{m,n} ~ Multinomial(6{m}).
= Draw word w{m,n} from ¢{z,m}, the topic-word distribution for topic 2 , .
1.5 Inference using Gibbs Sampling
To infer the posterior distribution of latent variables Z, ©, and ®, particularly ® and ® for topic
modeling with LDA, Gibbs sampling is often used.
Gibbs Sampling Steps:
1. Initialize ©, ®, and Z.
2. Iterate through each word wyy, ,, in each document $d_{m}:
2.1 Exclude wyy, , from © and ® and compute P(z,, , = k | otherZ, a, 8).
2.2 Sample zp, ,, from the conditional distribution P2y, = k | otherZ, a, ).
3. Update © and ® based on the new assignments of Z.
4. Repeat until convergence or after a sufficient number of iterations.
2. Procedure
Step 1. Imports and Hyperparameters
import numpy as np
from scipy.special import gammaln
# Hyperparameters
alpha = 0.1 # Dirichlet parameter for topic distribution
beta = 8.1 # Dirichlet parameter for word distribution
* numpy is imported for numerical operations, and scipy.special.gammaln is imported for
computing the logarithm of the gamma function.
¢ alpha and beta are Dirichlet priors for the topic distributions and word distributions,
respectively.

In [1]:
import numpy as np
from scipy.special import gammaln
import random
from collections import defaultdict
# Ensure reproducibility
np.random.seed(42)
random.seed(42)
# Hyperparameters
alpha = 0.1  # Dirichlet parameter for topic distribution
beta = 0.1   # Dirichlet parameter for word distribution

In [2]:
# Example documents
documents = [
    ['this', 'is', 'the', 'first', 'document'],
    ['this', 'is', 'the', 'second', 'document'],
    ['and', 'this', 'is', 'the', 'third', 'one'],
    ['is', 'this', 'a', 'test'],
    ['this', 'document', 'is', 'a', 'test']
]
# Create vocabulary and mappings
vocab = list(set(word for doc in documents for word in doc))
vocab_size = len(vocab)
word_to_index = {word: i for i, word in enumerate(vocab)}
index_to_word = {i: word for word, i in word_to_index.items()}
# Convert documents to word indices
docs_ids = [[word_to_index[word] for word in doc] for doc in documents]

In [3]:
n_topics = 2
n_docs = len(documents)
doc_lengths = np.array([len(doc) for doc in documents])
# Initialize topic assignments randomly
topic_assignments = [[random.randint(0, n_topics - 1) for _ in doc] for doc in documents]
# Initialize count matrices
doc_topic_counts = np.zeros((n_docs, n_topics))  # Number of words assigned to each topic in each document
topic_word_counts = np.zeros((n_topics, vocab_size))  # Number of times each word is assigned to each topic
topic_counts = np.zeros(n_topics)  # Total number of words assigned to each topic
# Populate the count matrices
for d, doc in enumerate(docs_ids):
    for i, word in enumerate(doc):
        topic = topic_assignments[d][i]
        doc_topic_counts[d][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1

In [4]:
num_samples = 1000
for _ in range(num_samples):
    for d, doc in enumerate(docs_ids):
        for i, word in enumerate(doc):
            current_topic = topic_assignments[d][i]
            # Decrement counts for the current word and topic
            doc_topic_counts[d][current_topic] -= 1
            topic_word_counts[current_topic][word] -= 1
            topic_counts[current_topic] -= 1
            # Calculate topic probabilities
            topic_probs = (doc_topic_counts[d] + alpha) * (topic_word_counts[:, word] + beta) / (topic_counts + vocab_size * beta)
            topic_probs /= np.sum(topic_probs)
            # Sample a new topic based on the probabilities
            new_topic = np.random.choice(np.arange(n_topics), p=topic_probs)
            topic_assignments[d][i] = new_topic
            # Increment counts for the new word and topic
            doc_topic_counts[d][new_topic] += 1
            topic_word_counts[new_topic][word] += 1
            topic_counts[new_topic] += 1

In [5]:
# Compute topic proportions for each document
doc_topic_probs = (doc_topic_counts + alpha) / (doc_lengths[:, None] + n_topics * alpha)

In [6]:
# Print topic proportions for each document
for d, doc in enumerate(documents):
    print(f"Document {d+1}: {doc}")
    for topic in range(n_topics):
        print(f"  Topic {topic}: {doc_topic_probs[d][topic]:.3f}")
    print()

Document 1: ['this', 'is', 'the', 'first', 'document']
  Topic 0: 0.981
  Topic 1: 0.019

Document 2: ['this', 'is', 'the', 'second', 'document']
  Topic 0: 0.981
  Topic 1: 0.019

Document 3: ['and', 'this', 'is', 'the', 'third', 'one']
  Topic 0: 0.500
  Topic 1: 0.500

Document 4: ['is', 'this', 'a', 'test']
  Topic 0: 0.976
  Topic 1: 0.024

Document 5: ['this', 'document', 'is', 'a', 'test']
  Topic 0: 0.981
  Topic 1: 0.019

