# **What is Topic Modeling?**
Topic modeling is an unsupervised machine learning technique used to identify and extract topics or themes from a collection of unstructured text data. It automatically groups similar words, phrases, or sentences into clusters that represent different underlying topics in the text corpus.

Some common algorithms for topic modeling include:

* Latent Dirichlet Allocation (LDA): A probabilistic model that assumes each document is a mixture of topics and each topic is a mixture of words.

* Non-Negative Matrix Factorization (NMF): A linear algebra-based method that decomposes text into parts corresponding to topics.

* Latent Semantic Analysis (LSA): A singular value decomposition-based method used for text similarity and topic extraction.

# **When and Where Do We Use Topic Modeling?**
Topic modeling is useful in any scenario where large volumes of unstructured text data need to be understood or analyzed. Some common applications include:

1. **Content Organization and Summarization
Use Case:** Automatically organizing articles, reports, or documents by their topics for easier navigation.
Example: Summarizing customer reviews to highlight common themes.
2. **Search Engine Optimization (SEO)
Use Case:** Improving the relevance of search results by identifying content topics.
Example: Google’s algorithms for categorizing and ranking web pages.
3. **Customer Feedback Analysis
Use Case:** Analyzing feedback from surveys, social media, or reviews to understand customer sentiment and issues.
Example: Analyzing online reviews to discover common product complaints.
4. **Content Recommendation Systems
Use Case:** Suggesting articles, movies, or products based on user preferences.
Example: Netflix using topic modeling to recommend shows based on viewing history.
5. **Academic and Legal Research
Use Case:** Extracting and summarizing key topics from large bodies of literature or legal documents.
Example: Identifying themes in historical archives or research papers.
6. **Market Research
Use Case:** Understanding public discussions about brands, products, or competitors by analyzing social media or news articles.
Example: Monitoring Twitter conversations to track trends and sentiments.
7. **Healthcare and Medical Research
Use Case:** Extracting topics from medical research papers or clinical notes.
Example: Identifying common side effects from patient reports.
8. **Political Analysis
Use Case:** Analyzing speeches, debates, or social media to understand political sentiment and key issues.
Example: Discovering election topics discussed by voters on social media.


# **Advantages**
* Provides a high-level overview of large text datasets.
* Helps identify hidden patterns in unstructured data.
* Facilitates decision-making by uncovering actionable insights.

**Limitations**

* Results may not always align with human interpretation due to subjectivity in language.
* Requires large datasets for reliable output.
* Needs proper pre-processing, like removing stop words and stemming.

In [1]:
!pip install gensim



In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim import corpora
from gensim.models import LdaModel
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

# **1. Load Your Data**
A dataset containing the text we want to analyze.

In [5]:
documents = [
    "The cat sat on the mat.",
    "Dogs are loyal and friendly animals.",
    "I love programming in Python.",
    "Artificial intelligence is the future of technology.",
    "Cats and dogs are common pets.",
    "Natural language processing is a branch of AI."
]

# **2. Preprocess the Text Data**
Preprocessing involves tokenizing, removing stop words, and other cleaning steps.

In [9]:
def preprocess_text(doc):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(doc.lower())  # Tokenize and convert to lowercase
    tokens = [word for word in tokens if word.isalpha() and word not in stop_words]  # Remove non-alphabetic tokens and stop words
    return tokens

# Preprocess each document
processed_docs = [preprocess_text(doc) for doc in documents]

# **3. Create a Dictionary and Corpus**

A dictionary maps each word to a unique ID, and the corpus represents the documents in a bag-of-words format.

In [10]:
# Create a dictionary and filter out extreme words
dictionary = corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=1, no_above=0.5)

# Create a corpus
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

# **4. Build the LDA Model**

The LDA model identifies latent topics in the text.

In [11]:
# Train the LDA model
num_topics = 2  # Specify the number of topics
lda_model = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary, passes=10, random_state=42)

# Print the topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

Topic 0: 0.113*"dogs" + 0.068*"friendly" + 0.068*"common" + 0.068*"cats" + 0.068*"loyal" + 0.068*"pets" + 0.068*"animals" + 0.068*"cat" + 0.068*"mat" + 0.068*"sat"
Topic 1: 0.065*"language" + 0.065*"branch" + 0.065*"processing" + 0.065*"ai" + 0.065*"natural" + 0.065*"technology" + 0.065*"artificial" + 0.065*"future" + 0.065*"intelligence" + 0.065*"python"


# **5. Visualize the Topics**

We will use pyLDAvis for an interactive visualization.
[PyLDAvis is a Python library used for interactive visualization of topic models created with algorithms like Latent Dirichlet Allocation (LDA). It is particularly helpful for understanding and interpreting the topics generated by an LDA model, as it provides a visual interface that allows users to explore topics and their relationships to words and documents.

* Visualizes topics as circles in a 2D plane using Multidimensional Scaling (MDS). The size of the circle represents the importance of the topic.

* Displays the most relevant terms for a selected topic and their contribution.]

In [14]:
!pip install pyLDAvis==3.3.1 # Install a version of pyLDAvis which is known to work.

import pyLDAvis
import pyLDAvis.gensim_models

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis) # Display the vis object directly within the notebook

  and should_run_async(code)


Collecting pyLDAvis==3.3.1
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.7 MB[0m [31m18.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sklearn (from pyLDAvis==3.3.1)
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a p

# **6. Assign Topics to Documents**
We can infer the dominant topic for each document.

In [15]:
def get_document_topics(corpus, lda_model):
    topics = []
    for bow in corpus:
        topic_scores = lda_model.get_document_topics(bow)
        dominant_topic = sorted(topic_scores, key=lambda x: x[1], reverse=True)[0]
        topics.append(dominant_topic)
    return topics

# Get topics for each document
document_topics = get_document_topics(corpus, lda_model)
print(document_topics)

[(0, 0.86966133), (0, 0.8963828), (1, 0.8691533), (1, 0.8953342), (0, 0.8963873), (1, 0.9128408)]


  and should_run_async(code)


# **Conclusion**
The conclusion of this tutorial is that topic modeling is a robust and efficient technique for uncovering hidden themes in text data. By following the steps provided, you can preprocess your data, build an LDA model, and extract meaningful insights such as dominant topics and their relevance to each document
