You have a sparse feature matrix and want to reduce the dimensionality.

Use Truncated Singular Value Decomposition (TSVD):




In [5]:
# Load libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
from sklearn import datasets
import numpy as np
# Load the data
digits = datasets.load_digits()

In [6]:
# Standardize feature matrix
features = StandardScaler().fit_transform(digits.data)
# Make sparse matrix
features_sparse = csr_matrix(features)
# Create a TSVD
tsvd = TruncatedSVD(n_components=10)
# Conduct TSVD on sparse matrix
features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse)
# Show results
print("Original number of features:", features_sparse.shape[1])
print("Reduced number of features:", features_sparse_tsvd.shape[1])


Original number of features: 64
Reduced number of features: 10


TSVD is similar to PCA and in fact, PCA actually often uses non-truncated
Singular Value Decomposition (SVD) in one of its steps. In regular SVD, given
d features, SVD will create factor matrices that are d × d, whereas TSVD will
return factors that are n × n, where n is previously specified by a parameter. The
practical advantage of TSVD is that unlike PCA, it works on sparse feature
matrices.
One issue with TSVD is that because of how it uses a random number generator,
the signs of the output can flip between fittings. An easy workaround is to use
fit only once per preprocessing pipeline, then use transform multiple times.
As with linear discriminant analysis, we have to specify the number of features
(components) we want outputted. This is done with the n_components
parameter. A natural question is then: what is the optimum number of
components? One strategy is to include n_components as a hyperparameter to
optimize during model selection (i.e., choose the value for n_components that
produces the best trained model). Alternatively, because TSVD provides us with
the ratio of the original feature matrix’s variance explained by each component,
we can select the number of components that explain a desired amount of variance (95% or 99% are common values). For example, in our solution the
first three outputted components explain approximately 30% of the original
data’s variance:

In [7]:
# Sum of first three components' explained variance ratios
tsvd.explained_variance_ratio_[0:3].sum()


0.3003938539127293

We can automate the process by creating a function that runs TSVD with
n_components set to one less than the number of original features and then
calculate the number of components that explain a desired amount of the original
data’s variance:


In [8]:
# Create and run an TSVD with one less than number of features
tsvd = TruncatedSVD(n_components=features_sparse.shape[1]-1)
features_tsvd = tsvd.fit(features)
# List of explained variances
tsvd_var_ratios = tsvd.explained_variance_ratio_
# Create a function
def select_n_components(var_ratio, goal_var):
# Set initial variance explained so far
    total_variance = 0.0
# Set initial number of features
    n_components = 0
# For the explained variance of each feature:
    for explained_variance in var_ratio:
# Add the explained variance to the total
        total_variance += explained_variance
# Add one to the number of components
        n_components += 1
# If we reach our goal level of explained variance
        if total_variance >= goal_var:
# End the loop
            break
# Return the number of components
    return n_components

In [10]:
select_n_components(tsvd_var_ratios, 0.95)

40