#### SVD in NLP

In text mining, a collection of documents can be represented as a term-document matrix. 

    - Each row represents a term (a word), and each column represents a document. - An entry in the matrix might be the frequency of a term in a document. 
    - This is the foundation of Latent Semantic Analysis (LSA). 
    
- Original Data Matrix (A): A matrix where rows are terms and columns are documents.
- Left Singular Vectors (columns of U): These vectors represent the patterns of how words group together across documents. A vector with high values for terms like "programming," "software," and "code" would correspond to a "technology" topic.
- Right Singular Vectors (columns of V): These vectors represent how documents cluster around different topics. A document with a high score for the "technology" singular vector is likely a document about technology.   
- Singular Values ($\Sigma$): The largest singular values indicate the most important topics, while smaller singular values are associated with less common themes or noise. 

What happens with truncation?

- By performing a truncated SVD and keeping only the top k singular vectors, you reduce the dimensionality of the data. 
- The original term-document matrix, which might be sparse and noisy, is approximated by a lower-rank matrix that captures only the major underlying topics. 
- This compact representation is useful for tasks like: 
    - Topic discovery: The singular vectors directly reveal the main topics in the document collection.
    - Document similarity: Documents can be compared based on their scores for these main topics rather than individual word counts, revealing deeper semantic relationships. 

Example

Latent Semantic Analysis (LSA) on a tiny term–document matrix. 

It shows how SVD discovers hidden topics and smooths synonymy—even with a handful of documents.

Mini corpus

d1: “human computer interaction”

d2: “user interface computer”

d3: “graph theory trees”

d4: “graph minors survey”

-- Terms (rows): human, user, computer, interface, graph, tree(s)

-- Docs (cols): d1–d4

In [None]:
import numpy as np

# term order: human, user, computer, interface, graph, trees
# docs: d1, d2, d3, d4
X = np.array([
    [1, 0, 0, 0],  # human
    [0, 1, 0, 0],  # user
    [1, 1, 0, 0],  # computer
    [1, 1, 0, 0],  # interface
    [0, 0, 1, 1],  # graph
    [0, 0, 1, 0],  # trees (singularizing to 'tree' not necessary for the demo)
], dtype=float)

# 1) SVD
U, S, Vt = np.linalg.svd(X, full_matrices=False)

# Keep k=2 latent dimensions
k = 2
Uk, Sk, Vtk = U[:, :k], np.diag(S[:k]), Vt[:k, :]

# 2) Low-rank embeddings
# term embeddings: Uk * Sk     (6×2)
term_emb = Uk @ Sk

# doc embeddings:  Sk * Vtk^T  (2×4)
doc_emb = Sk @ Vtk

# 3) Rank docs for a query: "human computer"
# Make a term vector in original space and project to latent space
terms = ["human","user","computer","interface","graph","trees"]
q = np.zeros(len(terms))
for t in ["human","computer"]:
    q[terms.index(t)] = 1.0

# project query: q_k = q^T U_k S_k^{-1}
q_k = (q @ Uk) @ np.linalg.inv(Sk)  # 1×2

# cosine similarity between query (2D) and each doc (2×4 columns)
def cos_sim(a, B):
    # a shape: (2,), B shape: (2,n)
    num = (a[:,None] * B).sum(axis=0)
    den = np.linalg.norm(a) * np.linalg.norm(B, axis=0)
    return num / (den + 1e-12)

sims = cos_sim(q_k.ravel(), doc_emb)
for i, s in enumerate(sims, start=1):
    print(f"d{i} similarity: {s:.3f}")

# Also show which terms are “near” which docs in 2D (optional)
print("\nTerm embeddings (2D):\n", term_emb.round(3))
print("\nDoc embeddings (2D):\n", doc_emb.round(3))


d1 similarity: 1.000
d2 similarity: 1.000
d3 similarity: 0.000
d4 similarity: 0.000

Term embeddings (2D):
 [[-0.707  0.   ]
 [-0.707  0.   ]
 [-1.414  0.   ]
 [-1.414  0.   ]
 [ 0.    -1.376]
 [ 0.    -0.851]]

Doc embeddings (2D):
 [[-1.581 -1.581  0.     0.   ]
 [ 0.     0.    -1.376 -0.851]]


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline
import numpy as np

docs = [
    "human computer interaction",
    "user interface for computer applications",
    "graph theory and tree structures",
    "survey of graph minors and applications"
]

# 1) TF-IDF -> 2) TruncatedSVD (k=2) -> 3) L2-normalize
vectorizer = TfidfVectorizer(min_df=1)
svd = TruncatedSVD(n_components=2, random_state=0)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(vectorizer, svd, normalizer)

doc_vecs = lsa.fit_transform(docs)  # shape: (n_docs, 2)

print(doc_vecs)

# Query and rank
query = "human computer"
q_vec = lsa.transform([query])      # shape: (1, 2)

def cosine_sim(a, B):
    # a: (1,d), B: (n,d), assume already L2-normalized by pipeline
    return (B @ a.ravel())

scores = cosine_sim(q_vec, doc_vecs)
order = np.argsort(-scores)

print("Query:", query)
for i in order:
    print(f"doc{i+1} score={scores[i]:.3f}  -> {docs[i]}")


Query: human computer
doc1 score=1.000  -> human computer interaction
doc2 score=0.962  -> user interface for computer applications
doc4 score=0.063  -> survey of graph minors and applications
doc3 score=-0.214  -> graph theory and tree structures
