# Latent Dirichlet Allocation

Latent Dirichlet Allocation(LDA) is a generative probability model used for text analysis, aimed at discovering hidden topics from a large number of documents. It assumes that each document is generated by a mixture of several topics, and each topic is a distribution of several words. The basic idea of LDA is to represent the vocabulary structure in a document set as a collection of topics, and to reveal these hidden topics by analyzing the distribution of words in the document.

## The generation process of LDA

**1. Prior distribution**:
- The topic distribution of each document follows a Dirichlet distribution.
- The word distribution of each topic also follows a Dirichlet distribution.

**2. Generation steps**:

- For each document:
  - Extracting the Topic Distribution of Documents from the Dirichlet Distribution.
  - For each word in the document:
    - Extracting a Topic from the Topic Distribution of a Document.
    - Extract a word from the word distribution of the selected topic.

The following is an example of implementing an LDA model using Python and the Gensim library. Gensim is a library used for natural language processing, particularly adept at handling large-scale text data.

In [4]:
import gensim
import gensim.corpora as corpora
from gensim.models import LdaModel
import string

stop_words = set([
    "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves",
    "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their",
    "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are",
    "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an",
    "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about",
    "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up",
    "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when",
    "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no",
    "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don",
    "should", "now"
])

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey"
]

texts = [
    [word for word in doc.lower().split() if word.isalnum() and word not in stop_words]
    for doc in documents
]
print(texts)
print()

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
print(corpus)

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'], ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'management', 'system'], ['system', 'human', 'system', 'engineering', 'testing', 'eps'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'], ['graph', 'minors', 'survey']]

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)], [(4, 1), (10, 1), (12, 1), (13, 1), (14, 1)], [(3, 1), (10, 2), (13, 1), (15, 1), (16, 1)], [(8, 1), (11, 1), (12, 1), (17, 1), (18, 1), (19, 1), (20, 1)], [(21, 1), (22, 1), (23, 1), (24, 1), (25, 1)], [(24, 1), (26, 1), (27, 1), (28, 1)], [(24, 1), (26, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1)], [(9, 1)

In [5]:
num_topics = 3
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42)

for idx, topic in lda_model.print_topics(num_topics):
    print(f"Topic {idx}: {topic}")

new_doc = "Human computer interaction"
new_bow = dictionary.doc2bow(new_doc.lower().split())
print("New document topic distribution:", lda_model.get_document_topics(new_bow))

Topic 0: 0.058*"user" + 0.047*"response" + 0.045*"relation" + 0.044*"time" + 0.044*"measurement" + 0.043*"error" + 0.042*"perceived" + 0.039*"trees" + 0.038*"well" + 0.037*"widths"
Topic 1: 0.140*"system" + 0.072*"user" + 0.071*"eps" + 0.048*"response" + 0.048*"time" + 0.045*"survey" + 0.045*"computer" + 0.045*"human" + 0.044*"testing" + 0.044*"opinion"
Topic 2: 0.083*"graph" + 0.080*"trees" + 0.053*"minors" + 0.039*"survey" + 0.039*"binary" + 0.038*"generation" + 0.038*"intersection" + 0.038*"random" + 0.038*"unordered" + 0.038*"paths"
New document topic distribution: [(0, 0.11525135), (1, 0.7552332), (2, 0.12951544)]
