# A Manifold Learning Application

## Objectives

- Apply various manifold learning techniques to the Iris dataset to visualize the data in two dimensions.
- Evaluate how different techniques handle the complexity and characteristics of the dataset.
- Explore the efficacy of these methods in separating different species based on floral measurements.

## Background

Manifold learning techniques simplify high-dimensional data into lower dimensions while attempting to preserve its intrinsic structure, making it easier to analyze and visualize.

## Datasets Used

The Iris dataset is a classic dataset in machine learning. It consists of 150 samples of iris flowers from three different species. Each sample has four features: sepal length, sepal width, petal length, and petal width.

## Iris Dataset

In [1]:
import numpy as np
import pandas as pd

import plotly.express as px 
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import MDS, LocallyLinearEmbedding, Isomap, TSNE

In [2]:
iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target_names[iris.target]
print(df.shape)
df.head()

(150, 5)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Preprocessing Data

In [3]:
# Missing values analysis
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
dtype: int64

# Visualizing Data

In [4]:
px.bar(
    x=df.species.value_counts().index,
    y=df.species.value_counts(),
    color=df.species.value_counts().index,
    width=700,
    height=400,
    title='Iris Dataset'
)

There are 50 cases of each species.

In [5]:
df_melt = pd.melt(df, id_vars=['species'])
print(df_melt.shape)
df_melt.head()

(600, 3)


Unnamed: 0,species,variable,value
0,setosa,sepal length (cm),5.1
1,setosa,sepal length (cm),4.9
2,setosa,sepal length (cm),4.7
3,setosa,sepal length (cm),4.6
4,setosa,sepal length (cm),5.0


In [6]:
# Create a grouped box plot
fig = px.box(
    df_melt,
    x='variable',
    y='value',
    color='species',
    width=800,
    height=400,
    title='Box Plot of Iris Features'
)
# Change the legend position to 'top'
fig.update_layout(legend=dict(x=0.4, y=1.2, orientation='h'))
# Show the plot
fig.show()

The original data is four-dimensional (the length and the width of the sepals and petals) of three species: setosa, versicolor, and virginica.

## Standardizing Data

In [7]:
# Create a box plot with the original data
fig_o = px.box(
    df_melt,
    x='variable',
    y='value',
    width=700,
    height=400,
    title='Box Plot of Original Iris Features'
)
# Change the legend position to 'top'
fig_o.update_layout(legend=dict(x=0.4, y=1.2, orientation='h'))
# Show the plot
fig_o.show()

As you can see, the four variables have different scales, so they need to be standardized before applying any manifold learning technique.

In [8]:
scaler = StandardScaler()

In [9]:
dfS = pd.DataFrame(
    scaler.fit_transform(
        df[['sepal length (cm)', 'sepal width (cm)',
            'petal length (cm)', 'petal width (cm)']]
    ),
    columns=df.columns[:-1]
)

dfS['species'] = df['species']

dfS.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,-0.900681,1.019004,-1.340227,-1.315444,setosa
1,-1.143017,-0.131979,-1.340227,-1.315444,setosa
2,-1.385353,0.328414,-1.397064,-1.315444,setosa
3,-1.506521,0.098217,-1.283389,-1.315444,setosa
4,-1.021849,1.249201,-1.340227,-1.315444,setosa


In [10]:
df_meltS = pd.melt(dfS, id_vars=['species'])
print(df_meltS.shape)
df_meltS.head()

(600, 3)


Unnamed: 0,species,variable,value
0,setosa,sepal length (cm),-0.900681
1,setosa,sepal length (cm),-1.143017
2,setosa,sepal length (cm),-1.385353
3,setosa,sepal length (cm),-1.506521
4,setosa,sepal length (cm),-1.021849


In [11]:
# Create a box plot with the standardized data
fig_s = px.box(
    df_meltS,
    x='variable',
    y='value',
    width=700,
    height=400,
    title='Box Plot of Standardized Iris Features'
)
# Change the legend position to 'top'
fig_s.update_layout(legend=dict(x=0.4, y=1.2, orientation='h'))
# Show the plot
fig_s.show()

## Principal Component Analysis

In [12]:
pca = PCA(n_components=2)
pca.fit(dfS.iloc[:, :-1])

print('Explained Variance Ratio =', pca.explained_variance_ratio_.round(2))

Explained Variance Ratio = [0.73 0.23]


The explained variance ratio of [0.73, 0.23] indicates that the first principal component accounts for 73% of the variance in the data, while the second accounts for 23%. 

Together, the first two principal components capture 96% (73% + 23%) of the total variability in the original data, leaving only 4% of the variance unexplained by these two dimensions.

In [13]:
df_pca = pd.DataFrame(pca.fit_transform(dfS.iloc[:, :-1]), columns=['PCA_1', 'PCA_2'])
df_pca['species'] = df['species']
df_pca.head()

Unnamed: 0,PCA_1,PCA_2,species
0,-2.264703,0.480027,setosa
1,-2.080961,-0.674134,setosa
2,-2.364229,-0.341908,setosa
3,-2.299384,-0.597395,setosa
4,-2.389842,0.646835,setosa


In [14]:
# Plottind the data in the PCA space
fig_pca = px.scatter(
    df_pca,
    x='PCA_1',
    y='PCA_2',
    color='species',
    width=700,
    height=400,
    title='PCA of Iris Dataset'
)
fig_pca.show()

 PCA transforms the original data into a new coordinate system with the axes (principal components) ordered by the variance they capture from the data. The first principal component (PCA_1 on the x-axis) captures the most variance, and the second principal component (PCA_2 on the y-axis) captures the second most.
- The setosa species (blue) is clearly separated from the other two species along the first principal component axis. That indicates setosa has distinct petal and sepal characteristics that PCA_1 captures.
- The versicolor (red) and virginica (green) species are more mixed but somewhat separated along both principal component axes. 
- A few points are far from the main clusters of their respective species, especially in the virginica species. These could be considered outliers.

## Multidimensional Scaling

In [15]:
mds = MDS(
    n_components=2,
    random_state=25,
    normalized_stress='auto'
)
df_mds = pd.DataFrame(
    mds.fit_transform(
        dfS.iloc[:, :-1]),
    columns=['MDS_1', 'MDS_2']
)
df_mds['species'] = df['species']

df_mds.head()

Unnamed: 0,MDS_1,MDS_2,species
0,2.32853,0.063686,setosa
1,1.929357,-1.062158,setosa
2,2.276073,-0.771324,setosa
3,2.169062,-1.018083,setosa
4,2.485198,0.19964,setosa


In [16]:
# Plottind the data in the MDS space
fig_mds = px.scatter(
    df_mds,
    x='MDS_1',
    y='MDS_2',
    color='species',
    width=700,
    height=400,
    title='MDS of Iris Dataset'
)
fig_mds.show()

MDS focuses on preserving the distances between pairs of objects.
- Similar to the PCA plot, the setosa species (blue) is separated from the other two, indicating unique characteristics that differ from versicolor and virginica species.
- The versicolor (red) and virginica (green) species are more intermixed, spread across the center of the plot, suggesting a higher degree of similarity between these two species.
- There is some overlap between versicolor and virginica, which indicates that while they have their unique characteristics, they are not as distinctly separated as setosa is from both.

## Locally Linear Embedding

In [17]:
lle = LocallyLinearEmbedding(
    n_neighbors=50,
    n_components=2,
    random_state=50,

    method='modified',
    eigen_solver='dense'
)

df_lle = pd.DataFrame(
    lle.fit_transform(dfS.iloc[:, :-1]),
    columns=['LLE_1', 'LLE_2']
)

df_lle['species'] = df['species']

df_lle.head()

Unnamed: 0,LLE_1,LLE_2,species
0,0.013588,-0.117717,setosa
1,-0.082846,-0.069888,setosa
2,-0.056038,-0.093083,setosa
3,-0.077105,-0.081928,setosa
4,0.026443,-0.128816,setosa


In [18]:
# Plottind the data in the LLE space
fig_lle = px.scatter(
    df_lle,
    x='LLE_1',
    y='LLE_2',
    color='species',
    width=700,
    height=400,
    title='LLE of Iris Dataset'
)
fig_lle.show()

The graph illustrates the Iris dataset's Locally Linear Embedding (LLE) transformation, a non-linear dimensionality reduction technique. 
- The setosa species (blue) are clearly separated from the other two species, indicating distinct local geometric features.
- The versicolor (red) and virginica (green) species somewhat overlap but also show some separation along both LLE axes.
- The distribution of points indicates that the two-dimensional representation of the data manifold kept the local linkages. However, the distinction among species is not well preserved.

## Isometric Mapping

In [19]:
iso = Isomap(
    n_neighbors=50, 
    n_components=2, 
    eigen_solver='dense'
)

df_iso = pd.DataFrame(
    iso.fit_transform(dfS.iloc[:, :-1]), 
    columns=['ISO_1', 'ISO_2']
)
df_iso['species'] = df['species']
df_iso.head()

Unnamed: 0,ISO_1,ISO_2,species
0,-2.525301,0.330084,setosa
1,-2.189223,-0.702436,setosa
2,-2.515714,-0.450076,setosa
3,-2.39646,-0.636151,setosa
4,-2.650858,0.562144,setosa


In [20]:
# Plottind the data in the Isomap space
fig_iso = px.scatter(
    df_iso,
    x='ISO_1',
    y='ISO_2',
    color='species',
    width=700,
    height=400,
    title='Isomap of Iris Dataset'
)
fig_iso.show()

Isomap is a manifold learning technique that reduces dimensionality by attempting to preserve the geodesic distances between all points. 
- The setosa species (blue) is distinctly clustered and separated from the other two species along the ISO_1 axis, indicating that its features are significantly different in the multidimensional space of the original features.
- The versicolor (red) and virginica (green) species show some overlap. 
- The relative positions of the species suggest that Isomap has captured meaningful global relationships within the data, providing a pretty good visualization of the dataset's structure in reduced dimensions.

## t-Distributed Stochastic Neighbor Embedding

In [21]:
tsne = TSNE(random_state=20)

df_tsne = pd.DataFrame(
    tsne.fit_transform(dfS.iloc[:, :-1]), 
    columns=['tSNE_1', 'tSNE_2']
)
df_tsne['species'] = df['species']
df_tsne.head()

Unnamed: 0,tSNE_1,tSNE_2,species
0,-26.283514,-0.93404,setosa
1,-22.458551,-1.582246,setosa
2,-23.502516,-0.568364,setosa
3,-22.724354,-0.502646,setosa
4,-26.786913,-0.364475,setosa


In [22]:
# Plottind the data in the tSNE space
fig_tsne = px.scatter(
    df_tsne,
    x='tSNE_1',
    y='tSNE_2',
    color='species',
    width=700,
    height=400,
    title='tSNE of Iris Dataset'
)
fig_tsne.show()

The graph depicts the results of t-SNE applied to the Iris dataset.
- The setosa species (blue) is clearly separated from the other two species, indicating a distinct structure in the multidimensional feature space.
- The versicolor (red) and virginica (green) species are closer together but still form two discernible clusters. This shows that t-SNE can separate species to some extent, even when their feature spaces are somewhat similar.
- t-SNE is particularly good at creating a map that reveals structures within the data, such as clusters of similar data points, which are quite distinct for the setosa species.

### Detecting a possible outlier

All the Manifold Learning techniques separate the setosa species from the other two, indicating their features are significantly different from the others.

The difficult task here is to separate the versicolor and virginica species. The tSNE method offers the better option, but looking closely at the graph, you will notice a green point around red points. It could be an outlier. Let's detect, remove, and recompute the tSNE method without it.

In [23]:
# Given the approximate coordinates in tSNE space of the outlier
outlier_tsne_coords = np.array([1.5, 0.83])
outlier_tsne_coords

array([1.5 , 0.83])

In [24]:
# Plottind the data in the tSNE space
fig_tsne_outl = px.scatter(
    df_tsne,
    x='tSNE_1',
    y='tSNE_2',
    color='species',
    width=700,
    height=400,
    title='tSNE of Iris Dataset with Outlier'
)
# Add a circle shape to highlight the point
fig_tsne_outl.add_shape(
    type="circle",          # unfilled Circle
    xref="x",
    yref="y",
    # Adjust these values to change the size of the circle
    x0=outlier_tsne_coords[0] - 1,
    y0=outlier_tsne_coords[1] - 0.3,
    x1=outlier_tsne_coords[0] + 1,
    y1=outlier_tsne_coords[1] + 0.3,
    line=dict(color="Black", width=2)
)
fig_tsne_outl.show()

We know that the possible outlier belongs to virginica species (green).

In [25]:
# Filter the DataFrame for only virginica species
virginica_tsne = df_tsne[df_tsne['species'] == 'virginica'].copy()
virginica_tsne.head()

Unnamed: 0,tSNE_1,tSNE_2,species
100,13.583465,-1.031314,virginica
101,7.881903,-1.946566,virginica
102,13.398383,1.66505,virginica
103,9.875397,-0.209483,virginica
104,11.905957,-0.291403,virginica


tSNE is a stochastic algorithm and does not preserve distances in a way that allows direct mapping back to the original space. So, we cannot take the coordinates in the tSNE plot and find the exact corresponding point in the original dataset.

To solve the problem, we will compute all the distances of virginica points to the coordinates of the potential outlier we identified visually. The point with the smallest distance is the possible outlier in the tSNE space. 

Once we recognize this point, we use its index to trace back to the original dataset to find the potential outlier's original feature values.

In [26]:
# Calculate the Euclidean distance of all virginica points from the outlier's t-SNE coordinates
virginica_tsne['distance_to_outlier'] = np.sqrt(
    ((virginica_tsne[['tSNE_1', 'tSNE_2']] - outlier_tsne_coords) ** 2).sum(axis=1)
)
virginica_tsne.head()

Unnamed: 0,tSNE_1,tSNE_2,species,distance_to_outlier
100,13.583465,-1.031314,virginica,12.225981
101,7.881903,-1.946566,virginica,6.959742
102,13.398383,1.66505,virginica,11.92765
103,9.875397,-0.209483,virginica,8.439656
104,11.905957,-0.291403,virginica,10.466207


In [27]:
# Find the index of the virginica point with the minimum distance to the outlier's t-SNE coordinates
outlier_index = virginica_tsne['distance_to_outlier'].idxmin()
outlier_index

106

### Recomputing tSNE

Let's remove the outlier and recompute tSNE.

In [28]:
# Drop the outlier from the original DataFrame
dfS_clean = dfS.drop(outlier_index)
print(dfS_clean.shape)
dfS_clean.head()

(149, 5)


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,-0.900681,1.019004,-1.340227,-1.315444,setosa
1,-1.143017,-0.131979,-1.340227,-1.315444,setosa
2,-1.385353,0.328414,-1.397064,-1.315444,setosa
3,-1.506521,0.098217,-1.283389,-1.315444,setosa
4,-1.021849,1.249201,-1.340227,-1.315444,setosa


In [29]:
df_tsne_clean = pd.DataFrame(
    tsne.fit_transform(dfS_clean.iloc[:, :-1]), 
    columns=['tSNE_1', 'tSNE_2']
)
df_tsne_clean['species'] = df['species']
df_tsne_clean.head()

Unnamed: 0,tSNE_1,tSNE_2,species
0,-27.609985,0.60075,setosa
1,-23.765287,0.988582,setosa
2,-24.868122,0.053289,setosa
3,-24.101377,-0.063488,setosa
4,-28.148527,0.070417,setosa


In [30]:
fig_tsne_clean = px.scatter(
    df_tsne_clean,
    x='tSNE_1',
    y='tSNE_2',
    color='species',
    width=700,
    height=400,
    title='tSNE of Iris Dataset without Outlier'
)

fig_tsne_clean.show()

Removing the outlier may affect the density and spread of the points. Suppose the outlier significantly impacted the t-SNE algorithm's distance calculations. Its removal might lead to a subtle change in the relative positions of the data points. That is not our case. Notice both graphs are very similar. The second does not have the outlier.

### Really an outlier?

Determining whether a point is an outlier typically involves several methods to identify observations that appear significantly different or distant from most of the data. 

We will visualize the data for the outlier.

In [31]:
print(df_melt.shape)
df_melt.head()

(600, 3)


Unnamed: 0,species,variable,value
0,setosa,sepal length (cm),5.1
1,setosa,sepal length (cm),4.9
2,setosa,sepal length (cm),4.7
3,setosa,sepal length (cm),4.6
4,setosa,sepal length (cm),5.0


In [32]:
# Filter the DataFrame for only virginica species
virginica_df_melt = df_melt[df_melt['species'] == 'virginica'].copy().reset_index(drop=True)
print(virginica_df_melt.shape)
virginica_df_melt.head()

(200, 3)


Unnamed: 0,species,variable,value
0,virginica,sepal length (cm),6.3
1,virginica,sepal length (cm),5.8
2,virginica,sepal length (cm),7.1
3,virginica,sepal length (cm),6.3
4,virginica,sepal length (cm),6.5


In [33]:
# Getting the outlier values
outlier_values = df.iloc[outlier_index, :-1].astype(float)
outlier_values

sepal length (cm)    4.9
sepal width (cm)     2.5
petal length (cm)    4.5
petal width (cm)     1.7
Name: 106, dtype: float64

In [34]:
# Plot the virginica species in the original feature space
fig_virg = px.box(
    virginica_df_melt,
    x='variable',
    y='value'
)
fig_virg.update_traces(marker_color=px.colors.qualitative.Plotly[2])
fig_virg.update_layout(
    title='Virginica Species with the Outlier',
    width=700,
    height=400
)
# Iterate through the outlier_index series and add scatter plot for the circle
for feature, value in outlier_values.items():
    # Add a scatter plot trace for each outlier with a large marker to represent a circumference
    fig_virg.add_trace(
        go.Scatter(
            x=[feature],
            y=[value],
            mode='markers',
            marker=dict(
                size=10,
                color='rgba(0,0,0,0)',  # Transparent fill
                line=dict(width=2)
            ),
            name='Outlier',
            showlegend=False)
    )
fig_virg.show()

## Conclusions

Key Takeaways
- Principal Component Analysis (PCA) captures the majority of variance within the Iris dataset in just two principal components.
- Multidimensional Scaling (MDS) preserves the pairwise distances well, showing distinct clustering for the setosa species.
- Locally Linear Embedding (LLE) highlights the local neighborhood structures but may cause some overlap between the versicolor and virginica species.
- Isometric Mapping (Isomap) effectively unfolds the global structure, offering a clear separation between the setosa and other species.
- t-Distributed Stochastic Neighbor Embedding (t-SNE) provides a striking visualization with clear separation among all three species, making it particularly effective for this dataset.

It looks like our point would indeed be an outlier!

## References

- https://scikit-learn.org/stable/modules/manifold.html
- https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
- Muller, A.C. & Guido, S. (2017) Introduction to Machine Learning with Python. A guide for Data scientists. USA: O'Reilly, chapter 3.
- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O'Reilly Media, Inc. chapter 5.