## Reducing Features on Sparse data

### Problem:
**You have a sparse features matrix and want to reduce dimensionality.**

### Solution:
**Use Truncated Singular Value Decomposition (TVSD)**

In [1]:
# Load libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
from sklearn import datasets
import numpy as np

In [2]:
# Load the data
digits = datasets.load_digits()

# Standardize the feature matrix
X = StandardScaler().fit_transform(digits.data)

# Make sparse matrix
X_sparse = csr_matrix(X)

In [3]:
# Create a TSVD
tsvd = TruncatedSVD(n_components=10)

In [4]:
# Conduct TSVD on sparse matrix
X_sparse_tsvd = tsvd.fit(X_sparse).transform(X_sparse)

In [5]:
# Show results
print('Original number of features:', X_sparse.shape[1])
print('Reduced number of features:', X_sparse_tsvd.shape[1])

Original number of features: 64
Reduced number of features: 10


In [6]:
# Sum of first three components' explained variance ratios
tsvd.explained_variance_ratio_[0:3].sum()

0.30039385371151334

### Selecting The Best Number Of Components For TSVD

In [17]:
# Create and run an TSVD with one less than number of features
tsvd = TruncatedSVD(n_components=X_sparse.shape[1]-1)
X_tsvd = tsvd.fit(X)

In [18]:
# List of explained variances
tsvd_var_ratios = tsvd.explained_variance_ratio_

In [19]:
# Create a function
def select_n_components(var_ratio, goal_var: float) -> int:
    # Set initial variance explained so far
    total_variance = 0.0
    
    # Set initial number of features
    n_components = 0
    
    # For the explained variance of each feature:
    for explained_variance in var_ratio:
        
        # Add the explained variance to the total
        total_variance += explained_variance
        
        # Add one to the number of components
        n_components += 1
        
        # If we reach our goal level of explained variance
        if total_variance >= goal_var:
            # End the loop
            break
            
    # Return the number of components
    return n_components

In [20]:
# Run function
select_n_components(tsvd_var_ratios, 0.95)

40

For more info on SVD and TSVD see [here](https://en.wikipedia.org/wiki/Singular_value_decomposition)

And for  sklearn TSVD implementation see [docs](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html) of Sklearn