# Real-World Use Case: Automated Document Organization

## 1. The Problem
We have a dump of 2,000 news articles. We want to organize them into "Topics" (Sports, Tech, Politics, etc.) without reading them.

## 2. The Unsupervised Pipeline (LSA)
Raw text is messy. We need a rigorous pipeline.
1.  **Vectorizer**: Turn words into numbers (TF-IDF). This creates 10,000+ features (mostly zeros).
2.  **SVD (LSA)**: Dimensionality reduction specifically for sparse matrices. Compresses 10,000 words into 50 "Concepts".
3.  **K-Means**: Clusters the documents based on these concepts.

## 3. Data
20 Newsgroups dataset.

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer

# 1. Load Data (Just 4 categories to make it clear)
categories = ['rec.sport.baseball', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)
print(f"Loaded {len(dataset.data)} documents.")

# 2. Build The LSA Pipeline
# Note: SVD output is not normalized, but KMeans likes normalized data (Cosine Similarity proxy).
# So we add a Normalizer step.
lsa_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_df=0.5, min_df=2, stop_words='english')),
    ('svd', TruncatedSVD(n_components=50, random_state=42)),
    ('normalizer', Normalizer(copy=False)), # Projects to unit sphere
    ('kmeans', KMeans(n_clusters=4, random_state=42))
])

# 3. Run it!
print("Running Analysis...")
lsa_pipeline.fit(dataset.data)
print("Done!")

# 4. Interpret Clusters
# We need to reverse-engineer what the clusters mean.
# We can look at the centroids in the SVD space, transform them back to TF-IDF space,
# and see which words have the highest weights.

original_space_centroids = lsa_pipeline['svd'].inverse_transform(lsa_pipeline['kmeans'].cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
terms = lsa_pipeline['tfidf'].get_feature_names_out()

print("\nTop terms per cluster:")
for i in range(4):
    print(f"Cluster {i}:")
    for ind in order_centroids[i, :10]:
        print(f" {terms[ind]}", end='')
    print("\n")