# Topic Modeling
- Topic Modeling is an unsupervised machine learning technique used to discover hidden themes or topics from a collection of documents.
- It helps summarize, organize, and understand large sets of text data 

## Types of Topic Modeling:
1. LDA (Latent Dirichlet Allocation) - Probabilistic Approach
2. NMF (Non-Negative Matrix Factor) - TF-IDF is used here & formula changes to W*H (here w is for document and h is for word)

* how *relevant* is the document is represented by "W"
* how *relevant* is the word is represented by "H"

# Steps in Topic Modeling Pipeline
1. Text Cleaning
    - Lowercasing
    - Removing Punctuations
    - Removing Stopwords
2. Vectorization
    - Use TF-IDF or CountVectorizer to convert text into vectors
3. Modeling
    - Apply NMF or LDA to extract topics
4. Topic Interpretation
    - Each topic = top N words
    - Each document = mix of topics

# Practical 10

## Topic Extraction - Extract the topics from the texts with the 

In [1]:
# Importing the libraries
import numpy as np
import pandas as pd
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from nltk.corpus import stopwords

In [3]:
nltk.download('stopwords')
stop_words=set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hrk84ya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
documents=["Machine learning is a method of data analysis that automates analytical model building.",
"Artificial intelligence is intelligence demonstrated by machines, unlike natural intelligence.",
"Deep learning is part of a broader family of machine learning methods based on artificial neural networks.",
"Data science is a interdisciplinary field that uses scientific methods to extract insights data.",
"Python and R are commonly used programming languages in data science and analytics.",
"Cloud computing enables on-demand access to a shared pool of configurable computing resources.",
"Cybersecurity is the practice of protecting systems, networks, and programs from digital attacks."
]

In [5]:
# step-1: text preprocessing function
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\W+', ' ', text)
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

processed_docs = [preprocess_text(doc) for doc in documents]

In [6]:
# step-2: convert text into TF-IDF matrix
vectorizer = TfidfVectorizer(max_features=5000)
tfidf_matrix = vectorizer.fit_transform(processed_docs)

In [7]:
# step-3: apply NMF to extract topics
num_topics = 3
nmf_model = NMF(n_components=num_topics, random_state=42)
W= nmf_model.fit_transform(tfidf_matrix)
H=nmf_model.components_



In [12]:
# Step-4: Display the top words in each topic
feature_names = vectorizer.get_feature_names_out()

def display_topics(model,feature_names,num_words=5):
    for topic_idx, topic in enumerate(model):
        print(f"Topic {topic_idx+1}:")
        print(", ".join([feature_names[i] for i in topic.argsort()[-num_words:]]))

display_topics(H, feature_names)

Topic 1:
used, analytics, commonly, science, data
Topic 2:
building, model, networks, machine, learning
Topic 3:
shared, cloud, pool, access, computing
