# ðŸ“Œ Topic: Latent Semantic Analysis (LSA)

### What you will learn
- What LSA is and why we use it to find "hidden" meanings
- Reducing dimensionality of text data to filter out noise
- Practical implementation using Gensim's `LsiModel` (Latent Semantic Indexing)
- How to group documents based on underlying concepts

### Why this matters
Synonyms are a big problem in NLP. A search for "cell phone" might miss a document about "smartphones" because the words are different. **Latent Semantic Analysis (LSA)** solves this by looking at patterns of word co-occurrence. It identifies concepts (topics) rather than just individual words, allowing it to find relationships between documents that don't share exact keywords.

---

## How does LSA work?

LSA uses a mathematical technique called **Singular Value Decomposition (SVD)**. It builds a term-document matrix and then compresses it into a much lower-dimensional space. 

Imagine your dataset has 10,000 unique words. LSA might compress that representation down to just 100 "topics." This compression forces the model to ignore noise and focus on the most important relationships between words.

In [None]:
import pandas as pd
from gensim import corpora
from gensim.models import LsiModel
from nltk.tokenize import word_tokenize
import nltk

# Sample corpus on different themes
raw_docs = [
    "The cat and dog were playing in the garden.",
    "The kitten and puppy were having fun outdoors.",
    "I love working with natural language processing and Python.",
    "Data science and machine learning are exciting fields.",
    "NLP models require large amounts of training text."
]

# Basic preprocessing
tokenized_docs = [word_tokenize(doc.lower()) for doc in raw_docs]

# Create a dictionary and corpus (bag-of-words) for Gensim
dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(text) for text in tokenized_docs]

## Step 1: Building the LSA Model

We use the `LsiModel` to reduce our corpus dimensionality. We'll ask it to find 2 main "topics" or concepts.

In [None]:
# Initialize LSA model with 2 topics
lsi = LsiModel(corpus, id2word=dictionary, num_topics=2)

# Display the topics discovered
for idx, topic in lsi.print_topics(-1):
    print(f"Topic {idx}: {topic}")

## Step 2: Interpreting the Results

Notice how the model puts words related to animals in one topic and words related to tech/NLP in another. The negative vs positive values show how much certain words pull a document toward or away from that specific concept.

## Key Takeaways

1.  **Noise Reduction**: By lowering dimensions, we throw away secondary word variations and keep the core semantic structure.
2.  **Semantic Clustering**: LSA helps find similarity even when documents don't share identical words.
3.  **Speed**: LSA is linear algebra based and usually very fast to compute compared to probabilistic models like LDA.

## Next steps:
- Explore **Topic Modeling (LDA)** to see a probabilistic approach to finding themes.
- Compare LSA results with word embeddings like Word2Vec.