# PCA and K-means Clustering
<br> Tutorial from: https://towardsdatascience.com/k-means-and-pca-for-image-clustering-a-visual-analysis-8e10d4abba40
<br>So the plan is to perform k-means on the data-set but only after applying PCA on it.

* Load the data-set from keras
* Pre-process the data, flatten the data (from 60000 x 28 x 28 array to 60000 x 784 array)
* Apply PCA on it to reduce the dimensions (784 to 420 using 0.98 variance)
* Apply K-means clustering on the PC data-set (10 clusters)
* Observe and Analyze the results using matplotlib and plotly

**Same steps, but on our dataset!**

In [None]:
#Loading required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

In [None]:
# Loading the matrix
X = np.memmap('/data/bioprotean/SVD/vg/matx_vg_scaled.mymemmap', dtype='float32', mode='r', shape=(159326,2941))

# Make an instance of the Model
variance = 0.90 #The higher the explained variance the more accurate the model will remain, but more dimensions will be present
pca = PCA(variance)

In [None]:
#fit the data according to our PCA instance
pca.fit(X)
print("Number of components before PCA  = " + str(X.shape[1]))
print("Number of components after PCA 0.90 = " + str(pca.n_components_)) 

In [None]:
Clus_dataSet = pca.transform(X)
print('Dimension of our data after PCA = '+ str(Clus_dataSet.shape))

**K-means Clustering**

* init : Initialization method of the centroids. Value will be: “k-means++”. k-means++: Selects initial cluster centers for k-mean clustering in a smart way to speed up convergence.

* n_clusters : The number of clusters to form as well as the number of centroids to generate. Value will be: 10 ( we have 10 classes according to INDEX, might not be best but good enough for our context)

* n_init : Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Value will be: 35 based on our inertia results (might not be the best but good enough for our context)

In [None]:
k_means = KMeans(init = 'k-means++', n_clusters = 8, n_init = 50)
k_means.fit(Clus_dataSet)

In [None]:
# 3D visualization of clusters
# install these if you haven’t
!source activate base
!pip install chart_studio --user
!pip install plotly --user

import plotly as py
import plotly.graph_objs as go
import plotly.express as px

In [None]:
!jupyter labextension list

In [None]:
!conda install -c conda-forge nodejs

In [None]:
!jupyter labextension install jupyterlab-plotly --debug

In [None]:
#3D Plotly Visualization of Clusters using go
# Set notebook mode to work in offline
py.offline.init_notebook_mode()
layout = go.Layout(
    title='<b>Cluster Visualisation</b>',
    yaxis=dict(
        title='<i>Y</i>'
    ),
    xaxis=dict(
        title='<i>X</i>'
    )
)
colors = ['red','green' ,'blue','purple','magenta','yellow','cyan','maroon','teal','black']
trace = [ go.Scatter3d() for _ in range(11)]
for i in range(0,10):
    my_members = (k_means.labels_ == i)
    index = [h for h, g in enumerate(my_members) if g]
    trace[i] = go.Scatter3d(
            x=Clus_dataSet[my_members, 0],# 0 is a component among the 420 components. Feel free to change it
            y=Clus_dataSet[my_members, 1],# 1 is a component among the 420 components. Feel free to change it
            z=Clus_dataSet[my_members, 2],# 2 is a component among the 420 components. Feel free to change it
            mode='markers',
            marker = dict(size = 2,color = colors[i]),
            hovertext=index,
            name='Cluster'+str(i),
   
            )
fig = go.Figure(data=[trace[0],trace[1],trace[2],trace[3],trace[4],trace[5],trace[6],trace[7],trace[8],trace[9]], layout=layout)
    
py.offline.iplot(fig)