# Machine Learning with Python Cookbook  
# Ch 9: Dimensionality Reduction Using Feature *Extraction*

## 9.1 Reducing Features Using Principal Components
Given a set of features, reduce the number of features while retaining variance in the data.

### Use principal component analysis with scikit's `PCA`:

In [1]:
# Load libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets

In [2]:
# Load the data
digits = datasets.load_digits() 

In [3]:
# Standardize the feature matrix
features = StandardScaler().fit_transform(digits.data)

In [4]:
# Create a PCA that will retain 99% of variance
pca = PCA(n_components=0.99, whiten=True)

In [5]:
# Conduct PCA
features_pca = pca.fit_transform(features)

In [6]:
# Show results
print(f"Original # of features: {features.shape[1]}")
print(f"Reduced # of features: {features_pca.shape[1]}")

Original # of features: 64
Reduced # of features: 54


PCA let us reduce our dimensionality by 10 features while still retaining 99% of the information (variance) in the feature matrix.

## 9.2 Reducing Features When Data Is Linearly Inseparable

### Use an extension of principal component analysis that uses kernels to allow for non-linear dimensionality reduction:

In [7]:
# Load libraries
from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles

In [8]:
# Create linearly inseparable data
features, _ = make_circles(n_samples=1000, random_state=1, 
                           noise=0.1, factor=0.1)

In [9]:
# Apply kernal (kernel?) PCA with radius basis function (RBF) kernel
kpca = KernelPCA(kernel='rbf', gamma=15, n_components=1)
features_kpca = kpca.fit_transform(features)

In [10]:
# Show results
print(f"Original # of features: {features.shape[1]}")
print(f"Reduced # of features: {features_kpca.shape[1]}")

Original # of features: 2
Reduced # of features: 1


## 9.3 Reducing Features by Maximizing Class Separability

### Try linear discriminant analysis (LDA) to project the features onto component axes that maximize the separation of classes:

In [11]:
# Load libraries
from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [12]:
# Load Iris flower dataset
iris = datasets.load_iris()
features = iris.data
target = iris.target

In [13]:
# Create and run an LDA, then use it to transform the features
lda = LinearDiscriminantAnalysis(n_components=1)
features_lda = lda.fit(features, target).transform(features)

In [14]:
# Show results
print(f"Original # of features: {features.shape[1]}")
print(f"Reduced # of features: {features_lda.shape[1]}")

Original # of features: 4
Reduced # of features: 1


Use `explained_variance_ratio_` to view the amount of variance explained by each component.

In [15]:
lda.explained_variance_ratio_

array([0.9912126])

Set `n_components` to `None` to return the ratio of variance explained by every component feature.

In [16]:
# Create and run LDA
lda = LinearDiscriminantAnalysis(n_components=None)
features_lda = lda.fit(features, target)

In [17]:
# Create array of explained variance ratios
lda_var_ratios = lda.explained_variance_ratio_

In [18]:
lda.explained_variance_ratio_

array([0.9912126, 0.0087874])

In [19]:
# Create function
def select_n_components(var_ratio, goal_var: float) -> int:
    # Set initial variance explained so far
    total_variance = 0.0
    
    # Set initial number of features
    n_components = 0
    
    # For the explained variance of each feature:
    for explained_variance in var_ratio:
        
        # Add the explained variance to the total
        total_variance += explained_variance
        
        # Add one to the number of components
        n_components += 1
        
        # If we reach our goal level of explained variance
        if total_variance >= goal_var:
            # End the loop
            break
    
    # Return the number of components
    print(f"{round(explained_variance, 3)*100}% variance explained "
          f"by {n_components} component(s).")
    return n_components

In [20]:
# Run function
select_n_components(lda_var_ratios, 0.95)

99.1% variance explained by 1 component(s).


1

## 9.4 Reducing Features Using Matrix Factorization
You have a feature matrix of nonegative values and want to reduce the dimensionality

### Use non-negative matrix factorization (NMF) to reduce tthe dimensionality of the feature matrix:

In [21]:
# Load libraries
from sklearn.decomposition import NMF
from sklearn import datasets

In [22]:
# Load the data
digits = datasets.load_digits()

In [23]:
# Load feature matrix
features = digits.data

In [24]:
# Create, fit, and apply NMF
nmf = NMF(n_components=10, random_state=1)
features_nmf = nmf.fit_transform(features)



In [25]:
# Show results
print(f"Original # of features: {features.shape[1]}")
print(f"Reduced # of features: {features_nmf.shape[1]}")

Original # of features: 64
Reduced # of features: 10


## 9.5 Reducing Features on Sparse Data
You have a sparse feature matrix and want to reduce dimensionality

### Use Truncated Singular Value Decomposition (TSVD):

In [26]:
# pip uninstall sklearn

In [27]:
# pip uninstall scikit-learn

In [28]:
# pip install sklearn

In [29]:
# Load libraries
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
from sklearn import datasets
import numpy as np

In [30]:
# import sklearn
# print(dir(sklearn))

In [31]:
# Load the data
digits = datasets.load_digits()

In [32]:
# Standardize feature matrix
features = StandardScaler().fit_transform(digits.data)

In [33]:
# Make sparse matrix
features_sparse = csr_matrix(features)

In [34]:
# Create a TSVD
tsvd = TruncatedSVD(n_components=10)

In [35]:
# Conduct TSVD on sparse matrix
features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse)

In [36]:
# Show results
print(f"Original # of features: {features_sparse.shape[1]}")
print(f"Reduced # of features: {features_sparse_tsvd.shape[1]}")

Original # of features: 64
Reduced # of features: 10


#### What is the optimum number of components?

In [37]:
# Sum of first three components' explained variance ratios
tsvd.explained_variance_ratio_[0:3].sum()

0.3003938539183401

In [38]:
features_sparse.shape

(1797, 64)

In [39]:
# Create and run an TSVD with one less than number of features
tsvd = TruncatedSVD(n_components=features_sparse.shape[1]-1)
features_tsvd = tsvd.fit(features)

In [41]:
# List of explained variances
tsvd_var_ratios = tsvd.explained_variance_ratio_
tsvd_var_ratios

array([1.20339161e-01, 9.56105440e-02, 8.44441489e-02, 6.49840791e-02,
       4.86015488e-02, 4.21411987e-02, 3.94208280e-02, 3.38938092e-02,
       2.99822101e-02, 2.93200255e-02, 2.78180546e-02, 2.57705509e-02,
       2.27530332e-02, 2.22717974e-02, 2.16522943e-02, 1.91416661e-02,
       1.77554709e-02, 1.63806927e-02, 1.59646017e-02, 1.48919119e-02,
       1.34796957e-02, 1.27193137e-02, 1.16583735e-02, 1.05764660e-02,
       9.75315947e-03, 9.44558990e-03, 8.63013827e-03, 8.36642854e-03,
       7.97693248e-03, 7.46471371e-03, 7.25582151e-03, 6.91911245e-03,
       6.53908536e-03, 6.40792574e-03, 5.91384112e-03, 5.71162405e-03,
       5.23636803e-03, 4.81807586e-03, 4.53719260e-03, 4.23162753e-03,
       4.06053070e-03, 3.97084808e-03, 3.56493303e-03, 3.40787181e-03,
       3.27835335e-03, 3.11032007e-03, 2.88575294e-03, 2.76489264e-03,
       2.59174941e-03, 2.34483006e-03, 2.18256858e-03, 2.03597635e-03,
       1.95512426e-03, 1.83318499e-03, 1.67946387e-03, 1.61236062e-03,
      

In [42]:
# Create a function
def select_n_components(var_ratio, goal_var):
    # Set initial variance explained so far
    total_variance = 0.0
    
    # Set initial number of features
    n_components = 0
    
    # For the explained variance of each feature:
    for explained_variance in var_ratio:
        
        # Add the explained variance to the total
        total_variance += explained_variance
        
        # Add one to the number of components
        n_components += 1
        
        # If we reach our goal level of explained variance
        if total_variance >= goal_var:
            # End the loop
            break

    print(f"{round(total_variance, 3)*100}% variance explained "
          f"by {n_components} component(s).")
    # Return the number of components
    return n_components

In [43]:
# Run function
select_n_components(tsvd_var_ratios, 0.95)

95.1% variance explained by 40 component(s).


40

In [44]:
select_n_components(tsvd_var_ratios, 0.99)

99.1% variance explained by 54 component(s).


54

In [45]:
select_n_components(tsvd_var_ratios, 0.997)

99.8% variance explained by 59 component(s).


59