#### Definition:
Latent Dirichlet Allocation (LDA) is a generative probabilistic model used for topic modeling. It aims to discover the underlying topics in a collection of documents. Each document is represented as a mixture of topics, and each topic is represented as a mixture of words.

#### Types:
1. Batch LDA: Processes the entire dataset at once.
2. Online LDA: Processes the dataset incrementally, suitable for large datasets.
#### Use Cases:
1. Topic Discovery: Identifying the main topics in a collection of documents.
2. Document Classification: Classifying documents based on their topics.
3. Recommender Systems: Recommending content based on identified topics.
4. Information Retrieval: Enhancing search engines by indexing documents based on topics.
#### Short Implementation:
We will use the gensim library to implement LDA in Python.

#### Step-by-Step Implementation:
Install the necessary libraries:

pip install gensim nltk


In [None]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import nltk
from nltk.corpus import stopwords
import re

# Download stopwords
nltk.download('stopwords')

# Sample data
data = [
    'I love natural language processing.',
    'Machine learning is fascinating.',
    'Artificial intelligence is the future.',
    'NLP is a subset of AI.',
    'Deep learning is a branch of machine learning.'
]

# Preprocessing function
def preprocess(text):
    result = []
    for token in simple_preprocess(text):
        if token not in stopwords.words('english') and len(token) > 3:
            result.append(token)
    return result

# Preprocess data
processed_data = [preprocess(doc) for doc in data]


Create the dictionary and corpus:

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(processed_data)

# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in processed_data]


#### Build the LDA model:

In [None]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=id2word,
                                            num_topics=2, 
                                            random_state=100,
                                            update_every=1,
                                            chunksize=10,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)


#### Print the topics:

In [None]:
# Print the topics
topics = lda_model.print_topics(num_words=3)
for topic in topics:
    print(topic)


#### Evaluate the model:

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=processed_data, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


#### Explanation:
1. Data Preprocessing: The text is tokenized, and stopwords are removed to create a list of meaningful words.
2. Dictionary and Corpus Creation: A dictionary (mapping of words to IDs) and a corpus (bag-of-words representation of the documents) are created.
3. Building the LDA Model: The LDA model is trained on the corpus to identify 2 topics. The parameters can be adjusted based on the dataset and requirements.
4. Printing Topics: The top words for each topic are printed.
5. Model Evaluation: Perplexity and coherence scores are used to evaluate the quality of the topics generated by the model.
LDA is a powerful tool for uncovering hidden topics in text data, making it useful for a wide range of applications in NLP and text mining.