<a href="https://colab.research.google.com/github/3lueLightning/multidimensional_viz/blob/main/multidimensional_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multidimensional analysis using TSNE

In this notebook we will apply multidimensional data visualization techincs to perform EDA. To a understand how TSNE working and can help us we will be applying it to a modified version of the famous Iris dataset, which contains a series of measurements regarding 3 speicies of Iris flowers. Or goal is to predict to which species each flower belongs to based on the various measurements. 

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

# Load the data
The original data has 4 features: 'sepal_length', 'sepal_width', 'petal_length', 'petal_width' and the target variable 'species'. To that we add 3 random variables. 

In [19]:
iris = px.data.iris()
iris["random_1"] = np.round(np.abs(np.random.normal(4, 2, len(iris))), 1)
iris["random_2"] = np.round(np.random.uniform(0, 100, len(iris)), 1)
iris["random_3"] = np.round(np.random.uniform(-5, 5, len(iris)), 1)

# First look at the data

In [27]:
fig = px.scatter_matrix(
    iris,
    dimensions=[
        "sepal_width",
        "sepal_length",
        "petal_width",
        "petal_length",
        "random_1",
        "random_2",
        "random_3"],
    color="species", 
    symbol="species",
    title="Scatter matrix of iris data set",
    labels={col:col.replace('_', ' ') for col in iris.columns},
    width=1300,
    height=1300
)
fig.show()

# Analyse 2 relevant variables

In [57]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go


colors = px.colors.qualitative.Plotly

X = iris[['sepal_length', 'sepal_width']]
X_tsne = TSNE(
    n_components=2,
    learning_rate='auto',
    init='pca',
    perplexity=20
).fit_transform(X)
iris_tsne = pd.DataFrame(X_tsne, columns=['x_1', 'x_2'])
iris_tsne['species'] = iris['species']

subplot_titles = ["Original 2D measurements", "2D -> 2D mapping"]
fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=subplot_titles,
)
iris_species = iris["species"].unique()

for i, species in enumerate(iris_species):
  fig.append_trace(
      go.Scatter(
          x=iris.loc[iris['species'] == species, 'sepal_length'],
          y=iris.loc[iris['species'] == species, 'sepal_width'],
          mode="markers",
          marker=dict(color=colors[i]),
          showlegend=False,
      ),
      row=1, col=1
  )

  fig.append_trace(
      go.Scatter(
            x=iris_tsne.loc[iris['species'] == species, 'x_1'],
            y=iris_tsne.loc[iris['species'] == species, 'x_2'],
            mode="markers",
            marker=dict(color=colors[i]),
            name=species,
      ),
      row=1, col=2
  )

# edit axis labels
fig['layout']['xaxis']['title']='sepal_length'
fig['layout']['xaxis2']['title']='x_1'
fig['layout']['yaxis']['title']='sepal_width'
fig['layout']['yaxis2']['title']='x_2'

fig.update_layout(height=800, width=1300, title_text="Impact of TSNE")
fig.show()


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



In [60]:
N_CLUSTERS = 2
kmeans = KMeans(n_clusters=N_CLUSTERS, random_state=69).fit(X_tsne)
iris_tsne['cluster'] = kmeans.labels_

fig = px.scatter(
    iris_tsne,
    x='x_1',
    y='x_2',
    color='species',
    symbol='cluster',
    height=800,
    width=1300,
)
fig.show()

# Analyse random variables

In [18]:
iris2_r = iris[['random_1', 'random_2', 'random_3', 'species']]

X2_r = iris2_r[['random_1', 'random_2', 'random_3']]
X2_r_tsne = TSNE(
    n_components=2,
    learning_rate='auto',
    init='pca',
    perplexity=20
).fit_transform(X2_r)
iris2_r_tsne = pd.DataFrame(X2_r_tsne, columns=['x_1', 'x_2'])
iris2_r_tsne['species'] = iris2_r['species']

fig2_r_tsne = px.scatter(
    iris2_r_tsne,
    x='x_1',
    y='x_2',
    color='species'
)
fig2_r_tsne.show()


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



In [None]:
X3 = iris[['sepal_length', 'sepal_width', 'petal_length']]
X3_embedded = TSNE(
    n_components=2,
    learning_rate='auto',
    init='pca',
    perplexity=20
).fit_transform(X)
iris3_embedded = pd.DataFrame(X3_embedded, columns=['x_1', 'x_2'])
iris3_embedded['species'] = iris['species']

fig = px.scatter(
    iris3_embedded,
    x='x_1',
    y='x_2',
    color='species'
)
fig.show()

In [None]:
fig = px.scatter_3d(
    iris,
    x='sepal_length',
    y='sepal_width',
    z='petal_width',
    color='species'
)
fig.show()