Manifold visualizer provides high dimensional visualization using
to embed instances described by many dimensions into 2, thus allowing the
creation of a scatter plot that shows latent structures in data. Unlike
decomposition methods such as PCA and SVD, manifolds generally use
nearest-neighbors approaches to embedding, allowing them to capture non-linear
structures that would be otherwise lost. The projections that are produced
can then be analyzed for noise or separability to determine if it is possible
to create a decision space in the data.
Manifold visualizer allows access to all currently available
scikit-learn manifold implementations by specifying the manifold as a string to the visualizer. The currently implemented default manifolds are as follows:
||Locally Linear Embedding (LLE) uses many local linear decompositions to preserve globally non-linear structures.|
||LTSA LLE: local tangent space alignment is similar to LLE in that it uses locality to preserve neighborhood distances.|
||Hessian LLE an LLE regularization method that applies a hessian-based quadratic form at each neighborhood|
||Modified LLE applies a regularization parameter to LLE.|
||Isomap seeks a lower dimensional embedding that maintains geometric distances between each instance.|
||MDS: multi-dimensional scaling uses similarity to plot points that are near to each other close in the embedding.|
||Spectral Embedding a discrete approximation of the low dimensional manifold using a graph representation.|
||t-SNE: converts the similarity of points into probabilities then uses those probabilities to create an embedding.|
Each manifold algorithm produces a different embedding and takes advantage of different properties of the underlying data. Generally speaking, it requires multiple attempts on new data to determine the manifold that works best for the structures latent in your data. Note however, that different manifold algorithms have different time, complexity, and resource requirements.
Manifolds can be used on many types of problems, and the color used in the scatter plot can describe the target instance. In an unsupervised or clustering problem, a single color is used to show structure and overlap. In a classification problem discrete colors are used for each class. In a regression problem, a color map can be used to describe points as a heat map of their regression values.
In a classification or clustering problem, the instances can be described by discrete labels - the classes or categories in the supervised problem, or the clusters they belong to in the unsupervised version. The manifold visualizes this by assigning a color to each label and showing the labels in a legend.
# Load the classification data set data = load_data('occupancy') # Specify the features of interest features = [ "temperature", "relative humidity", "light", "C02", "humidity" ] # Extract the instances and target X = data[features] y = data.occupancy
from yellowbrick.features.manifold import Manifold visualizer = Manifold(manifold='tsne', target='discrete') visualizer.fit_transform(X,y) visualizer.poof()
The visualization also displays the amount of time it takes to generate the
embedding; as you can see, this can take a long time even for relatively
small datasets. One tip is scale your data using the
another is to sample your instances (e.g. using
preserve class stratification) or to filter features to decrease sparsity in
One common mechanism is to use SelectKBest to select the features that have
a statistical correlation with the target dataset. For example, we can use
f_classif score to find the 3 best features in our occupancy dataset.
from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import f_classif model = Pipeline([ ("selectk", SelectKBest(k=3, score_func=f_classif)), ("viz", Manifold(manifold='isomap', target='discrete')), ]) # Load the classification dataset data = load_data("occupancy") # Specify the features of interest features = [ "temperature", "relative humidity", "light", "CO2", "humidity" ] # Extract the instances and target X = data[features] y = data.occupancy model.fit(X, y) model.named_steps['viz'].poof()
For a regression target or to specify color as a heat-map of continuous
target='continuous'. Note that by default the param
target='auto' is set, which determines if the target is discrete or
continuous by counting the number of unique values in
# Specify the features of interest feature_names = [ 'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age' ] target_name = 'strength' # Get the X and y data from the DataFrame X = data[feature_names] y = data[target_name]
visualizer = Manifold(manifold='isomap', target='continuous') visualizer.fit_transform(X,y) visualizer.poof()
.. automodule:: yellowbrick.features.manifold :members: Manifold :undoc-members: :show-inheritance: