<a href="https://colab.research.google.com/github/3lueLightning/multidimensional_viz/blob/main/multidimensional_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

read: https://distill.pub/2016/misread-tsne/

# Multidimensional analysis using TSNE

In this notebook we will apply multidimensional data visualization techincs to perform EDA. To a understand how TSNE working and can help us we will be applying it to a modified version of the famous Iris dataset, which contains a series of measurements regarding 3 speicies of Iris flowers. Or goal is to predict to which species each flower belongs to based on the various measurements. 

In [16]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

# Load the data
The original data has 4 features: 'sepal_length', 'sepal_width', 'petal_length', 'petal_width' and the target variable 'species'. To that we add 3 random variables. 

In [17]:
iris = px.data.iris()
ORGINAL_FEATURES = iris.columns[~iris.columns.isin(['species',	'species_id'])]
np.random.seed(69)
iris["random_1"] = np.round(np.random.uniform(0, 5, len(iris)), 1)
#iris["random_2"] = np.round(np.random.uniform(-5, 5, len(iris)), 1)

# First look at the data

In [18]:
fig = px.scatter_matrix(
    iris,
    dimensions=iris.columns[~iris.columns.isin(['species',	'species_id'])],
    color="species", 
    symbol="species",
    title="Scatter matrix of iris data set",
    labels={col:col.replace('_', ' ') for col in iris.columns},
    width=1300,
    height=1000
)
fig.show()

# Inutition behind TSNE 
TSNE allows us to map $N$ dimensions to $M$, where $M \leq N$. Therefore it is possible to map 2 dimensions to 2 dimensions to understand what kind of transformations are taking place and get an intituition for what is happening at higher dimensionality. TSNE tries to maintain clusters present in higher dimensions in the lower dimensional space.

In [19]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go


COLORS = px.colors.qualitative.Plotly

X = iris[['sepal_length', 'sepal_width']]
X_tsne = TSNE(
    n_components=2,
    learning_rate='auto',
    init='pca',
    perplexity=20
).fit_transform(X)
iris_tsne = pd.DataFrame(X_tsne, columns=['x_1', 'x_2'])
iris_tsne['species'] = iris['species']

subplot_titles = ["Original measurements", "TSNE"]
fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=subplot_titles,
)
iris_species = iris["species"].unique()

for i, species in enumerate(iris_species):
  fig.append_trace(
      go.Scatter(
          x=iris.loc[iris['species'] == species, 'sepal_length'],
          y=iris.loc[iris['species'] == species, 'sepal_width'],
          mode="markers",
          marker=dict(color=COLORS[i]),
          showlegend=False,
      ),
      row=1, col=1
  )

  fig.append_trace(
      go.Scatter(
            x=iris_tsne.loc[iris['species'] == species, 'x_1'],
            y=iris_tsne.loc[iris['species'] == species, 'x_2'],
            mode="markers",
            marker=dict(color=COLORS[i]),
            name=species,
      ),
      row=1, col=2
  )

# edit axis labels
fig['layout']['xaxis']['title']='sepal_length'
fig['layout']['xaxis2']['title']='x_1'
fig['layout']['yaxis']['title']='sepal_width'
fig['layout']['yaxis2']['title']='x_2'

fig.update_layout(height=600, width=1300, title_text="Impact of TSNE, 2D -> 2D mapping")
fig.show()


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



# Understanding TSNE representation

In [20]:
X = iris.loc[:, ~iris.columns.isin(['species',	'species_id'])]
X_tsne = TSNE(
    n_components=2,
    learning_rate='auto',
    init='pca',
    perplexity=20
).fit_transform(X)
iris_tsne = pd.concat(
    [
        pd.DataFrame(X_tsne, columns=['x_1', 'x_2'])
        , X
        , iris[["species"]]
     ]
    , axis="columns"
)


fig = px.scatter(
    iris_tsne,
    x='x_1',
    y='x_2',
    color='species',
    hover_data={
        col: ':.2f' for col in iris_tsne.columns if col != 'species'
    },
    height=600,
    width=800,
)
fig.show()


The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.



## Creating clusters
At the risk of overfitting there seems to be 3 clusters present. So set $K=3$ in our Kmeans algorithm. 

In [21]:
from plotly.validators.scatter.marker import SymbolValidator
from typing import List

def get_marker_symbols() -> List[str]:
  # get all symbols and metadata
  symbols_meta = SymbolValidator().values
  # extract symbol names with similar symbols next to each other, ex:
  # 'circle', 'circle-open', 'circle-dot', 'circle-open-dot'
  symbols = [symbols_meta[3 * (i + 1) - 1] for i in range(int(len(symbols_meta) / 3 - 1))]
  # sort symbol names in such a way that similar symbols stay far apart
  symbols = sorted(symbols, key=lambda x: x.split("-")[1] if "-" in x else "a")
  return symbols

In [22]:
"""
    init=np.array([
        [-33, 1.4],
        [9.6, 3.9],
        [12, -3.9]
    ])
    """

'\n    init=np.array([\n        [-33, 1.4],\n        [9.6, 3.9],\n        [12, -3.9]\n    ])\n    '

In [23]:
N_CLUSTERS = 3
colors = COLORS[3:]

kmeans = KMeans(
    n_clusters=N_CLUSTERS,
    random_state=69,
).fit(X_tsne)
iris_tsne = pd.concat(
    [
        pd.DataFrame(X_tsne, columns=['x_1', 'x_2']),
        X,
        pd.Series(kmeans.labels_, name="cluster"),
        iris[["species"]],
     ]
    , axis="columns"
)

main_features = iris.columns[
    ~iris.columns.isin(['species', "species_id"])
]
subplot_titles = ["TSNE 2D iris"] + list(main_features)
top_fig_specs = [
    [{"type": "scatter", "colspan": 2}, None],
]
feature_fig_specs =[
    [{"type": "scatter", "colspan": 2}, None] for _ in enumerate(main_features)
]
fig_specs = top_fig_specs + feature_fig_specs

fig = make_subplots(
    rows= len(main_features) + 1,
    cols=2,
    subplot_titles=subplot_titles,
    specs=fig_specs,
    vertical_spacing = 0.03
)

symbols = get_marker_symbols()
for i, species in enumerate(iris_species):
  for cluster in range(N_CLUSTERS):
    mask = (iris_tsne['species'] == species) & (iris_tsne['cluster'] == cluster)
    fig.append_trace(
        go.Scatter(
              x=iris_tsne.loc[mask, 'x_1'],
              y=iris_tsne.loc[mask, 'x_2'],
              mode="markers",
              marker={
                "color": colors[cluster],
                "symbol": symbols[i]
              },
              name=species,
        ),
        row=1,
        col=1
    )

  
for j, feature in enumerate(main_features):
  for cluster in range(N_CLUSTERS):
    fig.append_trace(
      go.Histogram(
          x=iris_tsne.loc[iris_tsne['cluster'] == cluster, feature],
          histnorm='probability',
          marker={
                "color": colors[cluster],
          },
          showlegend = j == 0,
          name=f"cluster {cluster}"
      ),
      row=j + len(top_fig_specs) + 1,
      col=1,
    )
# Overlay both histograms
fig.update_layout(height=400 * len(fig_specs), width=1300, barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)

  
fig.show()

In [None]:
fig = px.scatter_matrix(
    iris,
    dimensions=iris.columns[~iris.columns.isin(['species',	'species_id'])],
    color="species", 
    symbol="species",
    title="Scatter matrix of iris data set",
    labels={col:col.replace('_', ' ') for col in iris.columns},
    width=1300,
    height=1000
)
fig.show()

In [42]:
iris_tsne

Unnamed: 0,x_1,x_2,sepal_length,sepal_width,petal_length,petal_width,random_1,cluster,species
0,-34.769459,1.978231,5.1,3.5,1.4,0.2,1.5,1,setosa
1,-25.801302,0.674600,4.9,3.0,1.4,0.2,4.0,1,setosa
2,-35.012264,3.380640,4.7,3.2,1.3,0.2,1.8,1,setosa
3,-25.919237,1.184785,4.6,3.1,1.5,0.2,3.9,1,setosa
4,-29.418854,0.850530,5.0,3.6,1.4,0.2,2.8,1,setosa
...,...,...,...,...,...,...,...,...,...
145,5.776468,-4.825747,6.7,3.0,5.2,2.3,4.3,2,virginica
146,12.653972,-0.291859,6.3,2.5,5.0,1.9,2.3,0,virginica
147,6.234777,-4.266621,6.5,3.0,5.2,2.0,4.0,2,virginica
148,10.742510,-4.928128,6.2,3.4,5.4,2.3,3.0,2,virginica


In [51]:
#from scipy.stats import spearmanr

def corr_plot(df1: pd.DataFrame, df2: pd.DataFrame, targets: pd.Series, clusters: pd.Series):
  fig = make_subplots(
    rows=df1.shape[1],
    cols=df2.shape[1],
    vertical_spacing = 0.03,
    horizontal_spacing = 0.03,
  )
  for i, col1 in enumerate(df1.columns, start=1):
    for j, col2 in enumerate(df2.columns, start=1): 
      for c, cluster_name in enumerate(clusters.unique()):
        for t, label in enumerate(targets.unique()):
          mask = (clusters == cluster_name) & (targets == label)
          fig.append_trace(
              go.Scatter(
                  x=df1.loc[mask, col1],
                  y=df2.loc[mask, col2],
                  mode="markers",
                  marker={
                    "color": colors[c],
                    "symbol": symbols[t]
                  },
              ),
              row=i,
              col=j
          )
  fig.update_layout(height=300 * df1.shape[1], width=1300)
  fig.show()


corr_plot(
    iris_tsne[["sepal_length",	"sepal_width", "random_1"]],
    iris_tsne[["x_1", "x_2"]],
    iris_tsne.species,
    iris_tsne.cluster
)

sepal_length x_1 1 setosa
sepal_length x_1 1 versicolor
sepal_length x_1 1 virginica
sepal_length x_1 2 setosa
sepal_length x_1 2 versicolor
sepal_length x_1 2 virginica
sepal_length x_1 0 setosa
sepal_length x_1 0 versicolor
sepal_length x_1 0 virginica
sepal_length x_2 1 setosa
sepal_length x_2 1 versicolor
sepal_length x_2 1 virginica
sepal_length x_2 2 setosa
sepal_length x_2 2 versicolor
sepal_length x_2 2 virginica
sepal_length x_2 0 setosa
sepal_length x_2 0 versicolor
sepal_length x_2 0 virginica
sepal_width x_1 1 setosa
sepal_width x_1 1 versicolor
sepal_width x_1 1 virginica
sepal_width x_1 2 setosa
sepal_width x_1 2 versicolor
sepal_width x_1 2 virginica
sepal_width x_1 0 setosa
sepal_width x_1 0 versicolor
sepal_width x_1 0 virginica
sepal_width x_2 1 setosa
sepal_width x_2 1 versicolor
sepal_width x_2 1 virginica
sepal_width x_2 2 setosa
sepal_width x_2 2 versicolor
sepal_width x_2 2 virginica
sepal_width x_2 0 setosa
sepal_width x_2 0 versicolor
sepal_width x_2 0 virginic

In [None]:
X3 = iris[['sepal_length', 'sepal_width', 'petal_length']]
X3_embedded = TSNE(
    n_components=2,
    learning_rate='auto',
    init='pca',
    perplexity=20
).fit_transform(X)
iris3_embedded = pd.DataFrame(X3_embedded, columns=['x_1', 'x_2'])
iris3_embedded['species'] = iris['species']

fig = px.scatter(
    iris3_embedded,
    x='x_1',
    y='x_2',
    color='species'
)
fig.show()

In [None]:
fig = px.scatter_3d(
    iris,
    x='sepal_length',
    y='sepal_width',
    z='petal_width',
    color='species'
)
fig.show()