# Experiment Control: Full

<a id="contents"></a>
## Contents

1. <a href="#section1">Set Universal Experiment Parameters</a>
2. <a href="#section2">Pre-train Deep Clustering Model</a>
3. <a href="#section3">Train Deep Clustering Model</a>
4. <a href="#section4">Cluster Entire Dataset</a>
6. <a href="#section5">Evaluate Optimal Number of Clusters</a>

In [1]:
import importlib as imp
import os
import sys
sys.path.insert(0, '../RISCluster/')

from IPython.display import Markdown as md
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
if sys.platform == 'darwin':
    from sklearn.manifold import TSNE
elif sys.platform == 'linux':
    from cuml import TSNE
import torch
from torch.utils.data import DataLoader, Subset
from torchsummary import summary
from torchvision import transforms
from tqdm import tqdm

import models
from networks import AEC, DCM
import plotting
import utils

<a id="section1"></a>
## 1. Set Universal Experiment Parameters

In [2]:
exp_name = 'Full'
fname_dataset = f'../../../Data/{exp_name}/{exp_name}.h5'
%run QueryDB $fname_dataset

# Generate new sample index for data set?
genflag = False

 >> h5 dataset contains 427798 samples with dimensions [88,101]. <<


In [None]:
if genflag:
    M = 125000
    savepath = f'../../../Data/{exp_name}'
    %run GenerateSampleIndex.py $M $fname_dataset $savepath

In [3]:
universal = {
    'exp_name': exp_name,
    'fname_dataset': fname_dataset,
    'savepath': f'../../../Experiments/{exp_name}',
    'indexpath': f'../../../Data/{exp_name}/TraValIndex_M=125000.pkl'
}
device = utils.set_device()
transform = 'sample_norm_cent'
dataset = utils.H5SeismicDataset(
    fname_dataset,
    transform=transforms.Compose(
        [utils.SpecgramShaper(), utils.SpecgramToTensor()]
    )
)

CUDA device available, using GPU.


<a href="#contents">Return to Top</a>
<a id="section2"></a>
## 2. Pre-train Deep Clustering Model

### 2.1 Autoencoder Architecture

In [None]:
model_display = AEC().to(device)
summary(model_display, (1, 87, 100))

### 2.2 Configure Pre-training

In [None]:
# Image Sample Indexes for Example Waveforms:
img_index = '403049, 334383, 300610, 381290'

parameters = {
    'mode': 'pretrain',
    'n_epochs': 500,
    'show': False,
    'send_message': True,
    'early_stopping': True,
    'patience': 10,
    'transform': 'sample_norm_cent',
    'img_index': img_index,
    'km_metrics': False,
    'klist': '2, 20',
    'tb': True,
    'tbport': 6999,
    'workers': 16
}
hyperparameters = {
    'batch_size': '256, 512, 1024',
    'lr': '0.0001, 0.001'
}
init_path = utils.config_training(universal, parameters, hyperparameters)

### 2.3 View Detection Examples

In [None]:
imp.reload(plotting)

img_index_ = list(map(int, img_index.split(sep=',')))
fig = plotting.view_detections(fname_dataset, img_index_)
# fig.savefig(f"{universal['savepath']}/DetectExamples_{exp_name}.png", dpi=300)

### 2.4 Pre-train Autoencoder

In [None]:
md(f"To run the pre-training script, run the following code in the terminal with the proper ```conda``` environment activated:<br>```python runDCM.py {init_path}```")

<a id="BestAEC"></a>
### 2.5 Select Best Pre-training Run

From hyperparameter tuning, the best model is:

In [None]:
batch_size = 256
LR = 0.001

expserial = 'Exp20201220T091211'
runserial = f'Run_BatchSz={batch_size}_LR={LR}'
exp_path = f"../../../Experiments/{exp_name}/Models/AEC/{expserial}/{runserial}"

AEC_weights = f"{exp_path}/AEC_Params_Final.pt"

Return to <a href="#ConfigDCM">Section 3.2</a>.<br>
Return to <a href="#section5">Section 5</a>.

### 2.6 View Autoencoder Performance

In [None]:
path = f"{exp_path}/snapshots"
imgfile = f"{path}/{sorted(os.listdir(path))[-1]}"
img = matplotlib.image.imread(imgfile)
plt.figure(figsize=(12,6))
plt.imshow(img)
plt.show()

<a href="#contents">Return to Top</a>
<a id="section3"></a>
## 3. Train Deep Clustering Model

### 3.1 DCM Architecture

In [None]:
model_display = DCM(n_clusters=5).to(device)
summary(model_display, (1, 69, 175))

<a id="ConfigDCM"></a>
### 3.2 Configure Training
Run <a href="#BestAEC">2.5</a> first to get AEC weights.

In [None]:
parameters = {
    'mode': 'train',
    'n_epochs': 400,
    'update_interval': -1,
    'show': False,
    'send_message': True,
    'saved_weights': AEC_weights,
    'transform': 'sample_norm_cent',
    'tb': True,
    'tbport': 6999,
    'workers': 16,
    'init': 'gmm'
}
hyperparameters = {
    'n_clusters': '6, 7, 8, 9',
    'batch_size': '1024',
    'lr': '0.001',
    'gamma': '0.05',
    'tol': 0.002
}
init_path = utils.config_training(universal, parameters, hyperparameters)

### 3.3 Train Model

In [None]:
md(f"To run the training script, run the following code in the terminal with the proper ```conda``` environment activated:<br>```python runDCM.py {init_path}```")

<a id="BestDCM"></a>
### 3.4 Select Best DEC Run
From hyperparameter tuning, the best model is:

In [4]:
n_clusters = 8
batch_size = 1024
LR = 0.001

expserial = 'Exp20201226T120534'
runserial = f'Run_Clusters={n_clusters}_BatchSz={batch_size}_LR={LR}_gamma=0.05_tol=0.002'
exp_path = f"../../../Experiments/{exp_name}/Models/DCM/{expserial}/{runserial}"
DCM_weights = f"{exp_path}/DCM_Params_Final.pt"

Return to <a href="#section4">Section 4</a>.

### 3.5 Evaluate Training

#### 3.5.1 Load Data and Model Parameters

In [None]:
index_tra, _ = utils.load_TraVal_index(fname_dataset, universal['indexpath'])
tra_dataset = Subset(dataset, index_tra)
dataloader = DataLoader(tra_dataset, batch_size=1024, num_workers=16)

DCM_weights1 = f"{exp_path}/DCM_Params_Initial.pt"
DCM_weights2 = DCM_weights

model1 = DCM(n_clusters).to(device)
model1 = utils.load_weights(model1, DCM_weights1, device)
model2 = DCM(n_clusters).to(device)
model2 = utils.load_weights(model2, DCM_weights2, device)

centroids1 = model1.clustering.weights.detach().cpu().numpy()
centroids2 = model2.clustering.weights.detach().cpu().numpy()

data1 = models.infer_z(dataloader, model1, device, v=True)
data2 = models.infer_z(dataloader, model2, device, v=True)

_, labels1 = models.infer_labels(dataloader, model1, device)
_, labels2 = models.infer_labels(dataloader, model2, device)

#### 3.5.2 View Results

In [None]:
imp.reload(plotting)
p = 2
fig = plotting.cluster_gallery(
    model2,
    dataloader.dataset,
    fname_dataset,
    index_tra,
    device,
    data2,
    labels2,
    centroids2,
    p,
    True,
    True
)

In [None]:
fig.savefig('gallery.png', dpi=300)

#### 3.5.3 View t-SNE Results

In [None]:
M = len(data1)
# results1 = TSNE(n_components=2, perplexity=int(M/50), early_exaggeration=2000, learning_rate=int(M/25), n_iter=2500, verbose=0, random_state=2009).fit_transform(data1.astype('float64'))
# results2 = TSNE(n_components=2, perplexity=int(M/50), early_exaggeration=2000, learning_rate=int(M/25), n_iter=2500, verbose=0, random_state=2009).fit_transform(data2.astype('float64'))
fig1 = plotting.view_TSNE(results1, labels1, 't-SNE: Epoch 0', True)
fig2 = plotting.view_TSNE(results2, labels2, 't-SNE: Epoch 105', True)

In [None]:
fig1.savefig('tSNE_i.png', dpi=300)
fig2.savefig('tSNE_f.png', dpi=300)

#### 3.5.4 View DEC Dashboard

In [None]:
imp.reload(plotting)
p = 2
fig = plotting.centroid_dashboard(
    data2,
    labels2,
    centroids2,
    n_clusters,
    p,
    True
)

#### 3.5.5 View Centroid Distance Matrix

In [None]:
p = 2
fig = plotting.centroid_distances(
    data2,
    labels2,
    centroids2,
    n_clusters,
    p,
    True
)

#### 3.5.6 View Latent Space

In [None]:
imp.reload(plotting)
p = 2
fig = plotting.view_latent_space(
    data1,
    data2,
    labels1,
    labels2,
    centroids1,
    centroids2,
    n_clusters,
    p,
    True,
    True
)

In [None]:
fig.savefig('zspace.png', dpi=300)

#### 3.5.7 View CDF

In [None]:
imp.reload(plotting)
p = 2
fig = plotting.view_class_cdf(
    data1,
    data2,
    labels1,
    labels2,
    centroids1,
    centroids2,
    n_clusters,
    p,
    True,
    True
)

In [None]:
fig.savefig('CDF.png', dpi=300)

#### 3.5.8 View PDF

In [None]:
p = 2
fig = plotting.view_class_pdf(
    data1,
    data2,
    labels1,
    labels2,
    centroids1,
    centroids2,
    n_clusters,
    p,
    True,
    True
)

In [None]:
fig.savefig('PDF.png', dpi=300)

<a href="#contents">Return to Top</a>
<a id="section4"></a>
## 4. Cluster Entire Dataset
Run <a href="#BestDCM">3.4</a> first to get DCM weights.

In [5]:
parameters = {
    'mode': 'predict',
    'send_message': False,
    'saved_weights': DCM_weights,
    'transform': 'sample_norm_cent',
    'workers': 16,
    'tb': False
}
init_path = utils.config_training(universal, parameters)

In [6]:
md(f"To run the prediction script, run the following code in the terminal with the proper ```conda``` environment activated:<br>```python runDCM.py {init_path}```")

To run the prediction script, run the following code in the terminal with the proper ```conda``` environment activated:<br>```python runDCM.py ../../../Experiments/Full/init_predict.ini```

<a href="#contents">Return to Top</a>
<a id="section5"></a>
## 5. Evaluate Optimal Number of Clusters

### 5.1 Load Data
Run <a href="#BestAEC">2.5</a> first to get AEC weights.

In [None]:
index_tra, _ = utils.load_TraVal_index(fname_dataset, universal['indexpath'])

tra_dataset = Subset(dataset, index_tra)
dataloader = DataLoader(tra_dataset, batch_size=512, num_workers=16)

model = AEC().to(device)
model = utils.load_weights(model, AEC_weights, device)

### 5.2 Compute K-means Metrics

In [None]:
# klist = parameters['klist']
klist = '2, 20'
klist = np.arange(int(klist.split(',')[0]), int(klist.split(',')[1])+1)
inertia, silh, gap_g, gap_u = models.kmeans_metrics(dataloader, model, device, klist)

### 5.3 Plot Metrics

In [None]:
fig = plotting.view_cluster_stats(klist, inertia, silh, gap_g, gap_u, show=True)
np.save('kmeans_inertia', inertia)
# savepath = 'test.png'
# fig.savefig(savepath)