### Google.colab
Only execute this cell when use on google colab platform (colab).

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://github.com/Nak007/varclus">
    <img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

In [None]:
# Mount with google drive.
from google.colab import drive
drive.mount('/content/dirve')

# Import other libraries required.
!git clone 'http://github.com/Nak007/varclus.git'

# Install `factor_analyzer`
!pip install factor_analyzer

## Example

In [1]:
import numpy as np, pandas as pd, sys
from sklearn.datasets import load_breast_cancer
sys.path.append('/content/varclus')
from varclus import *
pd.options.display.float_format = '{:,.4f}'.format

In [2]:
X = pd.DataFrame(load_breast_cancer().data)
X.columns = load_breast_cancer().feature_names

The `maxclus` specifies that no more than `maxclus` clusters be computed. By default, `VariableClustering` splits and optimizes clusters until all clusters have a second eigenvalue less than `maxeigval2`.

In [3]:
vc = VariableClustering(maxclus=10, maxeigval2=0.8).fit(X)

For each cluster, `vc.info` displays as follows:
- the $n^{th}$ cluster (**Cluster**)
- the number of variables in the cluster (**N_Vars**),
- the total explained variation (**Eigval1**), 
- the second eigenvalue (**Eigval2**), and 
- the proportion of the total variance explained by the variables in the cluster (**VarProp**). 

The sum of **VarProp** indicates the total variation in the data that can be accounted for by the $n^{th}$ cluster component.

In [20]:
vc.info

Unnamed: 0_level_0,N_Vars,Eigval1,Eigval2,VarProp
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,8,7.2816,0.5022,0.2427
2,5,4.1322,0.5428,0.1377
3,5,3.6297,0.7629,0.121
4,2,1.912,0.088,0.0637
5,3,1.815,0.6028,0.0605
6,3,2.9083,0.0663,0.0969
7,2,1.8053,0.1947,0.0602
8,2,1.6998,0.3002,0.0567


`vc.r2` shows how the variables are clustered. It displays the R-square value of each variable with its own cluster and the R-square value with its nearest cluster. The R-square value for a variable with the nearest cluster should be low if the clusters are well separated. The last column displays the ratio of $(1-R^{2}_{own})/(1-R^{2}_{nearsest})$ for each variable. Small values of this ratio indicate good clustering.

In [19]:
vc.r2.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,RS_Own,RS_NC,RS_Ratio
Cluster,Variable,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,worst perimeter,0.9795,0.5556,0.0462
1,worst radius,0.9702,0.5394,0.0648
1,mean perimeter,0.9679,0.5198,0.0669
1,mean radius,0.9535,0.5,0.093
1,mean area,0.947,0.5848,0.1277
1,worst area,0.936,0.6028,0.1611
1,mean concave points,0.8118,0.6179,0.4926
1,worst concave points,0.7158,0.7255,1.0352
2,worst compactness,0.908,0.348,0.1411
2,worst concavity,0.8934,0.4118,0.1813


`vc.labels_` shows the $n^{th}$ cluster that each variable belongs to in each split ($k$). The columns represents layer index or $k-1$.

In [21]:
vc.labels_.sample(5)

Unnamed: 0,0,1,2,3,4,5,6
smoothness error,2,3,3,5,5,5,5
symmetry error,2,3,3,5,5,5,5
mean area,1,1,1,1,1,1,1
worst smoothness,2,2,2,2,2,7,7
worst perimeter,1,1,1,1,1,1,1
