You want to reduce the features to be used by a classifier

Try linear discriminant analysis (LDA) to project the features onto component
axes that maximize the separation of classes:



In [7]:
# Load libraries
from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Load Iris flower dataset:
iris = datasets.load_iris()
features = iris.data
target = iris.target
# Create and run an LDA, then use it to transform the features
lda = LinearDiscriminantAnalysis(n_components=1)
features_lda = lda.fit(features, target).transform(features)
# Print the number of features
print("Original number of features:", features.shape[1])
print("Reduced number of features:", features_lda.shape[1])

Original number of features: 4
Reduced number of features: 1


We can use explained_variance_ratio_ to view the amount of variance
explained by each component. In our solution the single component explained
over 99% of the variance

In [8]:
lda.explained_variance_ratio_

array([0.9912126])

LDA is a classification that is also a popular technique for dimensionality
reduction. LDA works similarly to principal component analysis (PCA) in that it
projects our feature space onto a lower-dimensional space. However, in PCA we
were only interested in the component axes that maximize the variance in the
data, while in LDA we have the additional goal of maximizing the differences
between classes. In this pictured example, we have data comprising two target
classes and two features. If we project the data onto the y-axis, the two classes
are not easily separable (i.e., they overlap), while if we project the data onto the
x-axis, we are left with a feature vector (i.e., we reduced our dimensionality by
one) that still preserves class separability. In the real world, of course, the
relationship between the classes will be more complex and the dimensionality will be higher, but the concept remains the same

![](./pics/maxmizing%20class%20seperability.jpg)

In scikit-learn, LDA is implemented using LinearDiscriminantAnalysis,
which includes a parameter, n_components, indicating the number of features
we want returned. To figure out what argument value to use with n_components
(e.g., how many parameters to keep), we can take advantage of the fact that
explained_variance_ratio_ tells us the variance explained by each outputted
feature and is a sorted array. For example

In [9]:
lda.explained_variance_ratio_

array([0.9912126])

Specifically, we can run LinearDiscriminantAnalysis with n_components set
to None to return the ratio of variance explained by every component feature,
then calculate how many components are required to get above some threshold
of variance explained (often 0.95 or 0.99)

In [11]:
# Create and run LDA
lda = LinearDiscriminantAnalysis(n_components=None)
features_lda = lda.fit(features, target)
# Create array of explained variance ratios
lda_var_ratios = lda.explained_variance_ratio_
# Create function
def select_n_components(var_ratio, goal_var: float) -> int:
    # Set initial variance explained so far
    total_variance = 0.0
    # Set initial number of features
    n_components = 0
# For the explained variance of each feature:
    for explained_variance in var_ratio:
# Add the explained variance to the total
        total_variance += explained_variance
# Add one to the number of components
        n_components += 1
# If we reach our goal level of explained variance
        if total_variance >= goal_var:
# End the loop
            break
# Return the number of components
    return n_components
# Run function
select_n_components(lda_var_ratios, 0.95)

1