<img src="https://digmet.files.wordpress.com/2014/12/step2-nsa-netvizz.png" width="650px" height="650px"/> 

# Visualizing High Dimensional Clusters

## Contents
1. [Introduction:](#1)
1. [Imports:](#2)
1. [Read the Data:](#3)
1. [Exploration/Engineering:](#4)
1. [Clustering:](#5)
1. [**Method #1:** *Principal Component Analysis* (PCA):](#6)
1. [**Method #2:** *T-Distributed Stochastic Neighbor Embedding* (T-SNE):](#7)
1. [Conclusion:](#8)
1. [Closing Remarks:](#9)

<a id="1"></a>
# Introduction:

In this notebook we will be exploring two different methods that can be used to visualize [clusters](https://en.wikipedia.org/wiki/Cluster_analysis) that were formed on high-dimensional data (data with more than three dimensions).

First, we will clean our data so that it's in a proper format for clustering, then, we will divide the data into three different clusters using [K-Means Clustering](https://en.wikipedia.org/wiki/K-means_clustering). After that, we will go ahead and visualize our three clusters using our two methods: [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA), and [T-Distributed Stochastic Neighbor Embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) (T-SNE).

The data we will be using will be the [Forest Cover Type Dataset](https://www.kaggle.com/uciml/forest-cover-type-dataset).

<a id="2"></a>
# Imports:

In [89]:
#Basic imports
import numpy as np
import pandas as pd

#sklearn imports
from sklearn.decomposition import PCA #Principal Component Analysis
from sklearn.manifold import TSNE #T-Distributed Stochastic Neighbor Embedding
from sklearn.cluster import KMeans #K-Means Clustering
from sklearn.preprocessing import StandardScaler #used for 'Feature Scaling'
from sklearn.metrics import silhouette_score

#plotly imports
import plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

<a id="3"></a>
# Read the data:

In [90]:
#df is our original DataFrame
df = pd.read_csv("./INSECTS-abrupt_balanced_norm.csv")

<a id="4"></a>
# Exploration/Engineering:

This is not a particularly important section of the Kernel as the bulk of the interesting work will be done in the next few sections. Feel free to skim this part, if you want.

First, we construct a new DataFrame, `X` that we can modify. `X` will begin as a 'copy' of the original DataFrame, `df`.

In [91]:
X = df.copy()

Any missing values?

In [92]:
X.isnull().sum()

Att1     0
Att2     0
Att3     0
Att4     0
Att5     0
Att6     0
Att7     0
Att8     0
Att9     0
Att10    0
Att11    0
Att12    0
Att13    0
Att14    0
Att15    0
Att16    0
Att17    0
Att18    0
Att19    0
Att20    0
Att21    0
Att22    0
Att23    0
Att24    0
Att25    0
Att26    0
Att27    0
Att28    0
Att29    0
Att30    0
Att31    0
Att32    0
Att33    0
class    0
dtype: int64

Sweet! No missing values. That saves us quite a bit of work.

In [93]:
X.head()

Unnamed: 0,Att1,Att2,Att3,Att4,Att5,Att6,Att7,Att8,Att9,Att10,...,Att25,Att26,Att27,Att28,Att29,Att30,Att31,Att32,Att33,class
0,0.507066,0.153333,0.226092,0.302447,0.007239,0.36912,0.332436,0.017807,0.032819,0.033009,...,0.006855,0.017358,0.01343,0.009138,0.006768,0.007291,0.009224,0.036218,0.162955,ae-albopictus-female
1,0.281661,0.355953,0.253196,0.340335,0.415631,0.503923,0.392029,0.003648,0.068381,0.011155,...,0.005631,0.014048,0.002431,0.007076,0.037682,0.003089,0.004207,0.004144,0.005044,ae-albopictus-female
2,0.19375,0.257782,0.183339,0.247017,0.302133,0.363522,0.269729,0.293543,0.293002,0.029522,...,0.023837,0.013922,0.081406,0.413674,0.295615,0.120392,0.036566,0.032652,0.025776,cx-quinq-female
3,0.514782,0.154867,0.016903,0.226084,0.297642,0.239111,0.248268,0.066745,0.11502,0.083407,...,0.020949,0.023019,0.021147,0.020813,0.019048,0.011606,0.013379,0.044839,0.123552,ae-albopictus-female
4,0.774337,0.012549,0.105751,0.033302,0.01717,0.049754,0.1735,0.05522,0.044184,0.034923,...,0.034876,0.060708,0.048119,0.027417,0.015022,0.010218,0.008121,0.012539,0.018058,ae-aegypti-male


If we look at the columns: `X["Horizontal_Distance_To_Hydrology"]` and `X[Vertical_Distance_To_Hydrology"]`, we see that we can create from them, a new column `X[Distance_To_Hydrology]`, which measures the shortest distance to Hydrology. We can calculate the values of this column through using the equation from the [Pythagorean Theorem](https://en.wikipedia.org/wiki/Pythagorean_theorem).

In [None]:
X["Distance_To_Hydrology"] = ( (X["Horizontal_Distance_To_Hydrology"] ** 2) + (X["Vertical_Distance_To_Hydrology"] ** 2) ) ** (0.5)

Now that we have `X["Distance_To_Hydrology"]`, and because there's nothing extra special about Vertical or Horizontal Distances to Hydrology, we can drop the original two columns:

In [None]:
X.drop(["Horizontal_Distance_To_Hydrology","Vertical_Distance_To_Hydrology"], axis=1, inplace=True)

In [None]:
X.head()

Next, if you take a look at the values contained within `X['Cover_Type']`, you'll notice that it contains numerically-encoded [categorical data](https://en.wikipedia.org/wiki/Categorical_variable). If we head over to the column descriptions on the [Forest Cover Type Dataset](https://www.kaggle.com/uciml/forest-cover-type-dataset) page, it says that:

> *1 = "Spruce/Fir", 2 = "Lodgepole Pine", 3 = "Ponderosa Pine", 4 = "Cottonwood/WIllow", 5 = "Aspen", 6 = "Douglas-fir", and 7 = "Krummholz".*

We'll relabel our data so that the values in `X['Cover_Type']` are more descriptive of what's really contained within it. We'll also do it so that we can easily apply a [one-hot-encoding](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding) to it, afterwards - so that `X['Cover_Type']` will be properly encoded along with the rest of the categorical data in `X`.

In [None]:
X['Cover_Type'].replace({1:'Spruce/Fir', 2:'Lodgepole Pine', 3:'Ponderosa Pine', 4:'Cottonwood/Willow', 5:'Aspen', 6:'Douglas-fir', 7:'Krummholz'}, inplace=True)

In [None]:
X.head()

And now we can 'one-hot-encode' this column:

In [94]:
#We use pandas's 'get_dummies()' method
X = pd.get_dummies(X)

In [95]:
X.head()

Unnamed: 0,Att1,Att2,Att3,Att4,Att5,Att6,Att7,Att8,Att9,Att10,...,Att30,Att31,Att32,Att33,class_ae-aegypti-female,class_ae-aegypti-male,class_ae-albopictus-female,class_ae-albopictus-male,class_cx-quinq-female,class_cx-quinq-male
0,0.507066,0.153333,0.226092,0.302447,0.007239,0.36912,0.332436,0.017807,0.032819,0.033009,...,0.007291,0.009224,0.036218,0.162955,False,False,True,False,False,False
1,0.281661,0.355953,0.253196,0.340335,0.415631,0.503923,0.392029,0.003648,0.068381,0.011155,...,0.003089,0.004207,0.004144,0.005044,False,False,True,False,False,False
2,0.19375,0.257782,0.183339,0.247017,0.302133,0.363522,0.269729,0.293543,0.293002,0.029522,...,0.120392,0.036566,0.032652,0.025776,False,False,False,False,True,False
3,0.514782,0.154867,0.016903,0.226084,0.297642,0.239111,0.248268,0.066745,0.11502,0.083407,...,0.011606,0.013379,0.044839,0.123552,False,False,True,False,False,False
4,0.774337,0.012549,0.105751,0.033302,0.01717,0.049754,0.1735,0.05522,0.044184,0.034923,...,0.010218,0.008121,0.012539,0.018058,False,True,False,False,False,False


<a id="5"></a>
# Clustering:

Now, before we get into clustering our data, we just need to do one more thing: [feature-scale](https://en.wikipedia.org/wiki/Feature_scaling#Standardization) our [numerical variables](https://www.dummies.com/education/math/statistics/types-of-statistical-data-numerical-categorical-and-ordinal/).

We need to do this because, while each of our categorical variables hold values of either 0 or 1, some of our numerical variables hold values like 2596 and 2785. If we were to leave our data like this, then K-Means Clustering would not give us such a nice result, since K-Means Clustering measures the [euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance) between data-points. This means that, if we were to leave our numeical variables un-scaled, then most of the distance measured between points would be attributed to the larger numerical variables, rather than any of the categorical variables.

To fix this problem we will scale all of our numerical variables through the use of sklearn's [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) tool. This tool allows us to scale each numerical variable such that each numerical variable's mean becomes 0, and it's variance becomes 1. This is a good way to make sure that all of the numerical variables are on roughly the same scale that the categorical (binary) variables are on.

But, to make sure we scale only our numerical variables -- and not our categorical variables --, we'll split our current DataFrame, `X`, into two other DataFrames: `numer` and `cater`; feature-scale. `numer`, then recombine the two DataFrames together again into a DataFrame that is suitable for clustering.

In [96]:
#numer is the DataFrame that holds all of X's numerical variables
numer = X[["Att1","Att2","Att3","Att4","Att5","Att6","Att7","Att8","Att9","Att10","Att11","Att12","Att13","Att14","Att15","Att16","Att17","Att18",
           "Att19","Att20","Att21","Att22","Att23","Att24","Att25","Att26","Att27","Att28","Att29","Att30","Att31","Att32","Att33",]]

In [None]:
#cater is the DataFrame that holds all of X's categorical variables
cater = X[["class_ae-aegypti-female","class_ae-aegypti-male","class_ae-albopictus-female","class_ae-albopictus-male","class_cx-quinq-female","class_cx-quinq-male"]]

In [None]:
numer.head()

In [None]:
cater.head()

Okay. Now that we have our separate numerical DataFrame, it's time to feature-scale it:

In [97]:
#Initialize our scaler
scaler = StandardScaler()

In [98]:
#Scale each column in numer
numer = pd.DataFrame(scaler.fit_transform(numer))

We'll rename the columns to show that they've been scaled:

In [99]:
numer.columns = ["Att1","Att2","Att3","Att4","Att5","Att6","Att7","Att8","Att9","Att10","Att11","Att12","Att13","Att14","Att15","Att16","Att17","Att18",
           "Att19","Att20","Att21","Att22","Att23","Att24","Att25","Att26","Att27","Att28","Att29","Att30","Att31","Att32","Att33",]

Now we can re-merge our two DataFrames into a new, scaled `X`.

In [100]:
#X = pd.concat([numer, cater], axis=1, join='inner')
X = numer

In [101]:
X.head()

Unnamed: 0,Att1,Att2,Att3,Att4,Att5,Att6,Att7,Att8,Att9,Att10,...,Att24,Att25,Att26,Att27,Att28,Att29,Att30,Att31,Att32,Att33
0,0.559641,-0.844583,0.269394,0.490328,-1.439682,0.170525,-0.332905,-0.536618,-0.50075,-0.369955,...,-0.265002,-0.552496,-0.354897,-0.487487,-0.596957,-0.55122,-0.50946,-0.430717,0.009176,1.969155
1,-0.611995,0.591479,0.492784,0.758901,0.872287,0.853445,0.151551,-0.77081,0.029759,-0.803763,...,-0.562118,-0.573583,-0.42288,-0.684422,-0.625782,0.091232,-0.601255,-0.537905,-0.556968,-0.550889
2,-1.06895,-0.104304,-0.082973,0.097406,0.229758,0.142165,-0.842677,4.024094,3.380625,-0.439173,...,-0.023671,-0.259944,-0.425468,0.729612,5.058152,5.451573,1.9613,0.153447,-0.053768,-0.220035
3,0.599748,-0.83371,-1.454729,-0.05098,0.204334,-0.488108,-1.017142,0.272823,0.725514,0.63046,...,0.045054,-0.309696,-0.238627,-0.349315,-0.433749,-0.296018,-0.415196,-0.341945,0.161347,1.340337
4,1.948894,-1.842384,-0.722449,-1.417537,-1.383461,-1.4474,-1.624962,0.082198,-0.331208,-0.331961,...,-0.319758,-0.069773,0.535458,0.133614,-0.34143,-0.379686,-0.445518,-0.454282,-0.408787,-0.343203


**Time to build our clusters.**

In this kernel, we will be visualizing only three different clusters on our data. I chose three because I found it to be a good number of clusters to help us visualize our data in a non-complicated way.

In [115]:
#Initialize our model
kmeans = KMeans(n_clusters=6)

In [116]:
#Fit our model
kmeans.fit(X)





In [117]:
#Find which cluster each data-point belongs to
clusters = kmeans.predict(X)

In [118]:
silhouette_score(X,clusters)

0.13488262369235784

In [None]:
#Add the cluster vector to our DataFrame, X
X["Cluster"] = clusters

Now that we have our clusters, we can begin visualizing our data!

<a id="6"></a>
# **Method #1:** *Principal Component Analysis* (PCA):

Our first method for visualization will be [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA). 

PCA is an algorithm that is used for [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) - meaning, informally, that it can take in a DataFrame with many columns and return a DataFrame with a *reduced* number of columns that still retains much of the information from the columns of the original DataFrame. The columns of the DataFrame produced from the PCA procedure are called *Principal Components*. We will use these principal components to help us visualize our clusters in 1-D, 2-D, and 3-D space, since we cannot easily visualize the data we have in higher dimensions. For example, we can use two principal components to visualize the clusters in 2-D space, or three principal components to visualize the clusters in 3-D space.

But first, we will create a seperate, smaller DataFrame, `plotX`, to plot our data with. The reason we create a smaller DataFrame is so that we can plot our data faster, and so that our plots do not turn out looking too messy or over-crowded.

In [None]:
#plotX is a DataFrame containing 5000 values sampled randomly from X
plotX = pd.DataFrame(np.array(X.sample(5000)))

#Rename plotX's columns since it was briefly converted to an np.array above
plotX.columns = X.columns

(The reason we converted `X.sample(5000)` to a numpy array, then back to a pandas DataFrame, is so that the indices of the resulting DataFrame, `plotX`, are *'renumbered'* 0-4999. )

Now, to visualize our data, we will build three DataFrames from `plotX` using the 'PCA' algorithm. 

The *first* DataFrame will hold the results of the PCA algorithm with only one principal component. This DataFrame will be used to visualize our clusters in *one dimension* ([**1-D**](#PCA_1D)).

The *second* DataFrame will hold the two principal components returned by the PCA algorithm with `n_components=2`. This DataFrame will aid us in our visualization of these clusters in *two dimensions* ([**2-D**](#PCA_2D)).

And the *third* DataFrame will hold the results of the PCA algorithm that returns three principal components. This DataFrame will allow us to visualize the clusters in *three dimensional space* ([**3-D**](#PCA_3D)).

We initialize our PCA models:

In [None]:
#PCA with one principal component
pca_1d = PCA(n_components=1)

#PCA with two principal components
pca_2d = PCA(n_components=2)

#PCA with three principal components
pca_3d = PCA(n_components=3)

We build our new DataFrames:

In [None]:
#This DataFrame holds that single principal component mentioned above
PCs_1d = pd.DataFrame(pca_1d.fit_transform(plotX.drop(["Cluster"], axis=1)))

#This DataFrame contains the two principal components that will be used
#for the 2-D visualization mentioned above
PCs_2d = pd.DataFrame(pca_2d.fit_transform(plotX.drop(["Cluster"], axis=1)))

#And this DataFrame contains three principal components that will aid us
#in visualizing our clusters in 3-D
PCs_3d = pd.DataFrame(pca_3d.fit_transform(plotX.drop(["Cluster"], axis=1)))

(Note that, above, we performed our PCA's on data that *excluded* the `Cluster` variable.)

Rename the columns of these newly created DataFrames:

In [None]:
PCs_1d.columns = ["PC1_1d"]

#"PC1_2d" means: 'The first principal component of the components created for 2-D visualization, by PCA.'
#And "PC2_2d" means: 'The second principal component of the components created for 2-D visualization, by PCA.'
PCs_2d.columns = ["PC1_2d", "PC2_2d"]

PCs_3d.columns = ["PC1_3d", "PC2_3d", "PC3_3d"]

We concatenate these newly created DataFrames to `plotX` so that they can be used by `plotX` as columns.

In [None]:
plotX = pd.concat([plotX,PCs_1d,PCs_2d,PCs_3d], axis=1, join='inner')

And we create one new column for `plotX` so that we can use it for 1-D visualization.

In [None]:
plotX["dummy"] = 0

Now we divide our DataFrame, `plotX`, into three new DataFrames. 

Each of these new DataFrames will hold all of the values contained in exacltly one of the clusters. For example, all of the values contained within the DataFrame, `cluster0` will belong to 'cluster 0', and all the values contained in DataFrame, `cluster1` will belong to 'cluster 1', etc.

In [None]:
#Note that all of the DataFrames below are sub-DataFrames of 'plotX'.
#This is because we intend to plot the values contained within each of these DataFrames.

cluster0 = plotX[plotX["Cluster"] == 0]
cluster1 = plotX[plotX["Cluster"] == 1]
cluster2 = plotX[plotX["Cluster"] == 2]



## PCA Visualizations:

In [None]:
#This is needed so we can display plotly plots properly
init_notebook_mode(connected=True)

<a id="PCA_1D"></a>
### 1-D Visualization:

The plot below displays our three original clusters on the single *principal component* created for 1-D visualization:

In [None]:
#Instructions for building the 1-D plot

#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["PC1_1d"],
                    y = cluster0["dummy"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)

#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["PC1_1d"],
                    y = cluster1["dummy"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["PC1_1d"],
                    y = cluster2["dummy"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster3["PC1_1d"],
                    y = cluster3["dummy"],
                    mode = "markers",
                    name = "Cluster 3",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)

data = [trace1, trace2, trace3]

title = "Visualizing Clusters in One Dimension Using PCA"

layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= '',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

<a id="PCA_2D"></a>
### 2-D visualization:

The next plot displays the three clusters on the two *principal components* created for 2-D visualization:

In [None]:
#Instructions for building the 2-D plot

#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["PC1_2d"],
                    y = cluster0["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)

#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["PC1_2d"],
                    y = cluster1["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["PC1_2d"],
                    y = cluster2["PC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)




data = [trace1, trace2, trace3]

title = "Visualizing Clusters in Two Dimensions Using PCA"

layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

<a id="PCA_3D"></a>
### 3-D Visualization:

This last plot below displays our clusters on the three *principal components* created for 3-D visualization:

In [None]:
#Instructions for building the 3-D plot

#trace1 is for 'Cluster 0'
trace1 = go.Scatter3d(
                    x = cluster0["PC1_3d"],
                    y = cluster0["PC2_3d"],
                    z = cluster0["PC3_3d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)

#trace2 is for 'Cluster 1'
trace2 = go.Scatter3d(
                    x = cluster1["PC1_3d"],
                    y = cluster1["PC2_3d"],
                    z = cluster1["PC3_3d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter3d(
                    x = cluster2["PC1_3d"],
                    y = cluster2["PC2_3d"],
                    z = cluster2["PC3_3d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)



data = [trace1, trace2, trace3]

title = "Visualizing Clusters in Three Dimensions Using PCA"

layout = dict(title = title,
              xaxis= dict(title= 'PC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'PC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

## PCA Remarks:

As we can see from the plots above: if you have data that is highly *clusterable*, then PCA is a pretty good way to view the clusters formed on the original data. Also, it would seem that visualizing the clusters is more effective when the clusters are visualized using more principle components, rather than less. For example, the 2-D plot did a better job of providing a clear visual representation of the clusters than the 1-D plot; and the 3-D plot did a better job than the 2-D plot!

<a id="7"></a>
# **Method #2:** *T-Distributed Stochastic Neighbor Embedding* (T-SNE):

Our next method for visualizing our clusters is [T-Distributed Stochastic Neighbor Embedding](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) (T-SNE).

Here is a good [video](https://www.youtube.com/watch?v=wvsE8jm1GzE) by Google that gives a quick overview of what the algorithm does. And here is a [video](https://www.youtube.com/watch?v=NEaUSP4YerM) that gives a helpful and simplified explanation of how the algorithm does what it does, if you're interested.

In short, T-SNE is an interesting and complicated machine learning algorithm that can help us visualize high-dimensional data. It is a method for performing dimensionality reduction, and it is for this reason that we can use it to help us visualize our three clusters that were built on high-dimensional data.

Note: And just like before, we will use this algorithm to visualize our data in [**1-D**](#T-SNE_1D), [**2-D**](#T-SNE_2D), and [**3-D**](#T-SNE_3D) space!

Once again, we create a sub-DataFrame called `plotX` that will hold a sample of the data from `X` for the purpose of visualization.

In [None]:
#plotX will hold the values we wish to plot
plotX = pd.DataFrame(np.array(X.sample(5000)))
plotX.columns = X.columns

Next up, we have to decide what level of `perplexity` we would like to use for our T-SNE algorithm. The `perplexity` is a hyperparameter used in the T-SNE algorithm that greatly determines how the data returned from the algorithm is distributed.

To see the role that `perplexity` plays in shaping the distibution of the data through T-SNE, check out this clearly written, and interactive [article](https://distill.pub/2016/misread-tsne/) by some of the Engineers/Scientists at [Google Brain](https://ai.google/research/teams/brain).

I have found, through a few trials, that `perplexity = 50` works fairly well for this data, but am convinced that there probably exists a more ideal value for `perplexity` between the values of `30` and `50`. If you're up for the challenge, feel free to fork this Kernel and try to find the value for `perplexity` that best displays the clusters formed on the original data.

In [None]:
#Set our perplexity
perplexity = 50

We initialize our T-SNE models:

In [None]:
#T-SNE with one dimension
tsne_1d = TSNE(n_components=1, perplexity=perplexity)

#T-SNE with two dimensions
tsne_2d = TSNE(n_components=2, perplexity=perplexity)

#T-SNE with three dimensions
tsne_3d = TSNE(n_components=3, perplexity=perplexity)

We build our new DataFrames to help us visualize our data in 1-D, 2-D, and 3-D space:

In [None]:
#This DataFrame holds a single dimension,built by T-SNE
TCs_1d = pd.DataFrame(tsne_1d.fit_transform(plotX.drop(["Cluster"], axis=1)))

#This DataFrame contains two dimensions, built by T-SNE
TCs_2d = pd.DataFrame(tsne_2d.fit_transform(plotX.drop(["Cluster"], axis=1)))

#And this DataFrame contains three dimensions, built by T-SNE
TCs_3d = pd.DataFrame(tsne_3d.fit_transform(plotX.drop(["Cluster"], axis=1)))

(Note that, above, we performed our T-SNE algorithms on data that *exluded* the `Cluster` variable.)

Rename the columns of these newly created DataFrames:

In [None]:
TCs_1d.columns = ["TC1_1d"]

PCs_1d.columns = ["PC1_1d"]

#"TC1_2d" means: 'The first component of the components created for 2-D visualization, by T-SNE.'
#And "TC2_2d" means: 'The second component of the components created for 2-D visualization, by T-SNE.'
TCs_2d.columns = ["TC1_2d","TC2_2d"]

TCs_3d.columns = ["TC1_3d","TC2_3d","TC3_3d"]

We concatenate these newly created DataFrames to `plotX` so that they can be used by `plotX` as columns.

In [None]:
plotX = pd.concat([plotX,TCs_1d,TCs_2d,TCs_3d], axis=1, join='inner')

And we create one new column for `plotX` so that we can use it for 1-D visualization.

In [None]:
plotX["dummy"] = 0

Now we divide our DataFrame, `plotX`, into three new DataFrames.

Each of these new DataFrames will hold all of the values contained in exacltly one of the clusters. For example, all of the values contained within the DataFrame, `cluster0` will belong to 'cluster 0', and all the values contained in DataFrame, `cluster1` will belong to 'cluster 1', etc.

In [None]:
cluster0 = plotX[plotX["Cluster"] == 0]
cluster1 = plotX[plotX["Cluster"] == 1]
cluster2 = plotX[plotX["Cluster"] == 2]

## T-SNE Visualizations:

<a id="T-SNE_1D"></a>
### 1-D Visualization:

The plot below displays our three original clusters on the single dimension created by T-SNE for 1-D visualization:

In [None]:
#Instructions for building the 1-D plot

#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["TC1_1d"],
                    y = cluster0["dummy"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)

#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["TC1_1d"],
                    y = cluster1["dummy"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["TC1_1d"],
                    y = cluster2["dummy"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)

data = [trace1, trace2, trace3]

title = "Visualizing Clusters in One Dimension Using T-SNE (perplexity=" + str(perplexity) + ")"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= '',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

<a id="T-SNE_2D"></a>
### 2-D Visualization:

The next plot displays the three clusters on the two dimensions created by T-SNE for 2-D visualization:

In [None]:
#Instructions for building the 2-D plot

#trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["TC1_2d"],
                    y = cluster0["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)

#trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["TC1_2d"],
                    y = cluster1["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["TC1_2d"],
                    y = cluster2["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)

data = [trace1, trace2, trace3]

title = "Visualizing Clusters in Two Dimensions Using T-SNE (perplexity=" + str(perplexity) + ")"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

<a id="T-SNE_3D"></a>
### 3-D Visualization:

This last plot below displays our clusters on the three dimensions created by T-SNE for 3-D visualization:

In [None]:
#Instructions for building the 3-D plot

#trace1 is for 'Cluster 0'
trace1 = go.Scatter3d(
                    x = cluster0["TC1_3d"],
                    y = cluster0["TC2_3d"],
                    z = cluster0["TC3_3d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(255, 128, 255, 0.8)'),
                    text = None)

#trace2 is for 'Cluster 1'
trace2 = go.Scatter3d(
                    x = cluster1["TC1_3d"],
                    y = cluster1["TC2_3d"],
                    z = cluster1["TC3_3d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 128, 2, 0.8)'),
                    text = None)

#trace3 is for 'Cluster 2'
trace3 = go.Scatter3d(
                    x = cluster2["TC1_3d"],
                    y = cluster2["TC2_3d"],
                    z = cluster2["TC3_3d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(0, 255, 200, 0.8)'),
                    text = None)

data = [trace1, trace2, trace3]

title = "Visualizing Clusters in Three Dimensions Using T-SNE (perplexity=" + str(perplexity) + ")"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

## T-SNE Remarks:


The T-SNE algorithm did a fairly decent job in visualizing the clusters, too. But, there were a few noticable differences when comparing it's resulting plots to PCA's resulting plots. 

One major difference between the plots produced by PCA and T-SNE is that T-SNE's plots seemed to have it's clusters overlapping with eachother more so than in PCA's plots. For example, if you look at the [**2-D plot**](#PCA_2D) fomed from PCA, you see three distinct sections of the data-points with strict, visible borders separating each colour into groups. Whereas, if you look at the [**2-D**](#T-SNE_2D) plot formed from T-SNE, you, again, see three sections formed within the data-points, but this time, datapoints between each cluster seem to 'intermingle' and overlap more.

The other major difference between the plots created by PCA and the plots created by T-SNE, is the shape. Because both PCA and T-SNE perform dimensionality reduction in very different ways (and with different objectives), the resulting shape or distibution of the points produced by the algorithms will almost always be very different.

Bear in mind that the plots resulting from the T-SNE algorithm are quite variable, in that they depend very heavily on the value chosen for `perplexity`.

<a id="8"></a>
# Conclusion:

So there you have it: two interesting methods to view clusters formed on high-dimensional data.
One method was the standard and reliable PCA algorithm, and the other method was the somewhat more interesting and exotic T-SNE algorithm.

Both algorithms definitely have their own strengths and weaknesses when it comes to performing this task, and I'd imagine that the effectiveness of each algorithm depends largely on the type of data being given. So, in the end, it's largely up to the user which algorithm he or she prefers to use when visualizing clusterings on high-dimensional data.

<a id="9"></a>
# Closing Remarks:

I learned about quite alot in the making of this kernel -- about clusterability, perplexity, how to use plotly, the importance of feature-engineering, and much more. In all honesty, this was a ton of fun to make and has only further deepened my interest in [unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning) and data visualization. I hope to make more kernels like this in the future and to continue to sharpen my skills in this area.

If you've got any feedback for me: please leave a comment below, as I'd love to hear what you've got to say. And if you found this kernel to be interesting or useful to you, please consider giving it an upvote - I'd appreciate it very much :)

Till next time!
*-Josh*