In [10]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD, SparsePCA

In [4]:
## make vectors
df = pd.read_excel("data/sample.xlsx")
samples = df.title.values
vectorizer = CountVectorizer(stop_words="english", ngram_range=(1,2))
X = vectorizer.fit_transform(samples)
X.shape

(269, 1312)

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA). It can work with scipy.sparse matrices efficiently.

In [8]:
svd = TruncatedSVD(n_components=100, n_iter=10, random_state=42)
X_tran = svd.fit_transform(X)
print(X_tran.shape)

(269, 100)


PCA - does not support sparse input, here we use svd combined with sparse_pca instead. However, PCA are used on image processing more often.

In [15]:
spca = SparsePCA(n_components=10, random_state=0)
X_svd = svd.fit_transform(X)
X_tran = spca.fit_transform(X_svd)
print(X_tran.shape)

(269, 10)
