# __Step 3.2: Clustering based on Tf-idf__

The goals for step 3.2 are to:
- Cluster docs so each clsuter representing a subfield of plant sciences

## ___Set up___

### Module import

```bash
conda install -c conda-forge umap-learn 
conda install -c conda-forge hdbscan
```

In [1]:
import sys, os, pickle
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
from tqdm import tqdm
from datetime import datetime
from scipy.sparse import csr_matrix

# Specify subspace-clustering module location
# The selfrerepsentation.py is modified so using tqdm instead of progressbar
module_selfrep = Path.home() / 'github/subspace-clustering/cluster'
sys.path.append(str(module_selfrep))
from selfrepresentation import ElasticNetSubspaceClustering

import umap
import hdbscan
from sklearn.manifold import TSNE
from sklearn.cluster import SpectralClustering, KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize, StandardScaler
from sklearn.decomposition import TruncatedSVD
from keras.layers import Dense, Input
from keras.models import Model
from keras.models import load_model
from keras.callbacks import ModelCheckpoint

### Key variables

In [2]:
# Reproducibility
seed = 20220609

# Setting working directory
proj_dir   = Path.home() / "projects/plant_sci_hist"
work_dir   = proj_dir / "3_key_term_temporal/3_2_tf_idf_clustering"
work_dir.mkdir(parents=True, exist_ok=True)

os.chdir(work_dir)

# science dictionary vocab directory
sci_dict_dir = proj_dir / "_vocab"

# specify plant science corpus
sparse_matrix_dir  = proj_dir / "3_key_term_temporal/3_1_pubmed_vocab"
sparse_matrix_file = sparse_matrix_dir / "tfidf_sparse_matrix_4542"
feat_name_file     = sparse_matrix_dir / "tfidf_feat_name_and_sum_4542"

### Read sparse matrix and feature names

In [3]:
# Load sparse matrix from a pickle
with open(sparse_matrix_file, 'rb') as f:
  X_vec = pickle.load(f)

# Load feature names and tf-idf sum
feat_sum = pd.read_csv(feat_name_file, sep='\t')

X_vec.shape, feat_sum.shape

((421658, 4542), (4542, 3))

### Normalization

[This](https://towardsdatascience.com/understand-data-normalization-in-machine-learning-8ff3062101f0) and [this](https://iq.opengenus.org/normalization-in-detail/) posts are helpful.

In [4]:
np.min(X_vec[:,1]), np.max(X_vec[:,1])

(0.0, 0.9355844320660135)

In [5]:
# Standardize each feature then convert to interger, so the memory requirement
# is not so bad. Cannot get the interger call to work, sill called as float32,
# but at least it is smaller.
X_vec_norm = StandardScaler(with_mean=False).fit_transform(X_vec).astype('float32')
type(X_vec_norm), X_vec_norm.shape

(scipy.sparse._csr.csr_matrix, (421658, 4542))

In [6]:
np.min(X_vec_norm[:,1]), np.max(X_vec_norm[:,1])

(0.0, 53.0883)

### Dimensionality reduction

Check out:
- [What, Why and How of t-SNE](https://towardsdatascience.com/what-why-and-how-of-t-sne-1f78d13e224d)
- [An Introduction to t-SNE with Python Example](https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1)
- [TSNE Visualization Example in Python ](https://www.datatechnotes.com/2020/11/tsne-visualization-example-in-python.html)
- [Introduction to t-SNE in Python with scikit-learn](https://danielmuellerkomorowska.com/2021/01/05/introduction-to-t-sne-in-python-with-scikit-learn/)

Cannot run locally:
- Move to HPC
- Try [this advice](https://stackoverflow.com/questions/32105302/dimension-reduction-with-t-sne), do PCA first then do tSNE. But PCA does not support Sparse input. Use TruncatedSVD instead.
  - Still need a lot of memory... don't understand why. 
- Don't do TSNE, Use the PCA matrix for clustering next.

In [7]:
svd = TruncatedSVD(n_components=100, n_iter=7, random_state=seed)
svd.fit(X_vec_norm)

In [8]:
X_vec_norm_pca = csr_matrix(svd.transform(X_vec_norm))
X_vec_norm_pca.shape, type(X_vec_norm_pca)

((421658, 100), scipy.sparse._csr.csr_matrix)

In [9]:
with open(work_dir / "X_vec_norm_pca", "wb") as f:
  pickle.dump(X_vec_norm_pca, f)

In [9]:
# Still won't work after dimensionality reduction. Make sure, still asked to
# build a 421658x421658 matrix to store pairwise info.
# Run in HPC using the full matrix

#model_tsne = TSNE(n_components=20, random_state=seed, perplexity=50,
#                  method='exact', n_iter=5000, verbose=2, n_jobs=16,
#                  init='random', learning_rate='auto')
#X_vec_norm_tsne = model_tsne.fit_transform(X_vec_norm_pca)


"\nmodel_tsne = TSNE(n_components=20, random_state=seed, perplexity=50,\n                  method='exact', n_iter=5000, verbose=2, n_jobs=16,\n                  init='random', learning_rate='auto')\nX_vec_norm_tsne = model_tsne.fit_transform(X_vec_norm_pca)\n"

## ___Clustering___

Check out:
- [How to cluster in High Dimensions](https://towardsdatascience.com/how-to-cluster-in-high-dimensions-4ef693bacc6)

### Kmeans

In [13]:
clus_kmeans = KMeans(n_clusters=500, random_state=seed, max_iter=400, verbose=3)

In [14]:
clus_kmeans.fit(X_vec_norm_pca)

Initialization complete
Iteration 0, inertia 70938928.0.
Iteration 1, inertia 54233964.0.
Iteration 2, inertia 52660980.0.
Iteration 3, inertia 52072628.0.
Iteration 4, inertia 51767040.0.
Iteration 5, inertia 51569404.0.
Iteration 6, inertia 51419312.0.
Iteration 7, inertia 51323668.0.
Iteration 8, inertia 51258548.0.
Iteration 9, inertia 51203852.0.
Iteration 10, inertia 51152516.0.
Iteration 11, inertia 51107504.0.
Iteration 12, inertia 51068444.0.
Iteration 13, inertia 51034852.0.
Iteration 14, inertia 51007640.0.
Iteration 15, inertia 50983460.0.
Iteration 16, inertia 50960728.0.
Iteration 17, inertia 50938448.0.
Iteration 18, inertia 50919500.0.
Iteration 19, inertia 50902460.0.
Iteration 20, inertia 50887404.0.
Iteration 21, inertia 50875092.0.
Iteration 22, inertia 50865476.0.
Iteration 23, inertia 50858040.0.
Iteration 24, inertia 50851720.0.
Iteration 25, inertia 50845236.0.
Iteration 26, inertia 50838092.0.
Iteration 27, inertia 50829612.0.
Iteration 28, inertia 50821368.0.


In [15]:
# Output the kmean fit
with open(work_dir / "clus_kmeans", "wb") as f:
  pickle.dump(clus_kmeans, f)

In [17]:
# Load kmean fit in case the session died.
with open(work_dir / "clus_kmeans", "rb") as f:
  clus_kmeans_loaded = pickle.load(f)

### Spectral

Run into memory problem
- MemoryError: Unable to allocate 1.08 TiB for an array with shape (148010221199,) and data type int64
- Do PCA first then run again.

In [18]:
cluster_spectral = SpectralClustering(n_clusters=500,
                                      assign_labels='discretize',
                                      random_state=seed, n_jobs=16, verbose=3)

In [19]:
cluster_spectral.fit(X_vec_norm_pca)

In [None]:
# Output the kmean fit
with open(work_dir / "clus_spectral", "wb") as f:
  pickle.dump(cluster_spectral, f)

In [None]:
# Load kmean fit in case the session died.
with open(work_dir / "clus_spectral", "rb") as f:
  clus_spectral_loaded = pickle.load(f)

### HDBSCAN

https://ml2021.medium.com/clustering-with-python-hdbscan-964fb8292ace

### Subspace-clustering

https://github.com/ChongYou/subspace-clustering
- Run into issue:
  - return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
  - ValueError: zero-size array to reduction operation maximum which has no identity

In [None]:
model = ElasticNetSubspaceClustering(n_clusters=1000, random_state=seed,
                                     algorithm='lasso_lars', n_jobs=16,
                                     gamma=50).fit(X_vec_norm_lil)

## ___Not used___

## ___N2d clustering___

Did not try to figure out how to pass a sparse matrix...

### Set up functions

Based on [this post](https://towardsdatascience.com/deep-clustering-with-sparse-data-b2eb1bf2922e)

In [None]:
def get_autoencoder(dims, act='relu'):
    n_stacks = len(dims) - 1
    x = Input(shape=(dims[0],), name='input')

    h = x
    for i in range(n_stacks - 1):
        h = Dense(dims[i + 1], activation=act, name='encoder_%d' % i)(h)

    h = Dense(dims[-1], name='encoder_%d' % (n_stacks - 1))(h)
    for i in range(n_stacks - 1, 0, -1):
        h = Dense(dims[i], activation=act, name='decoder_%d' % i)(h)

    h = Dense(dims[0], name='decoder_0')(h)

    model = Model(inputs=x, outputs=h)
    model.summary()
    return model

In [None]:
def learn_manifold(x_data, umap_min_dist=0.00, umap_metric='euclidean', 
                   umap_dim=10, umap_neighbors=30):
    md = float(umap_min_dist)
    return umap.UMAP(
        random_state=0,
        metric=umap_metric,
        n_components=umap_dim,
        n_neighbors=umap_neighbors,
        min_dist=md).fit_transform(x_data)

### Split train-test

In [None]:
X_train, X_test = train_test_split(X_vec_norm_df, test_size=0.2, 
                                   random_state=seed)

### Initialize autoencoder

In [None]:
batch_size         = 256
pretrain_epochs    = 64
encoded_dimensions = 10
shape              = [X_vec_norm_df.shape[-1], 250, 250, 500, encoded_dimensions]
autoencoder        = get_autoencoder(shape)
shape

In [None]:
encoded_layer = f'encoder_{(len(shape) - 2)}'

print(f'taking the last encoder layer:{encoded_layer}')

hidden_encoder_layer = autoencoder.get_layer(name=encoded_layer).output
encoder              = Model(inputs=autoencoder.input, 
                             outputs=hidden_encoder_layer)
autoencoder.compile(loss='mse', optimizer='adam')

### Train autoencoder

In [None]:
model_series = 'CLS_MODEL_' + datetime.now().strftime("%h%d%Y-%H%M")

checkpointer = ModelCheckpoint(filepath=f"{model_series}-model.h5", verbose=0, 
                               save_best_only=True)

autoencoder.fit(
    X_train,
    X_train,
    batch_size=batch_size,
    epochs=pretrain_epochs,
    verbose=1,
    validation_data=(X_test, X_test),
    callbacks=[checkpointer]
)

autoencoder = load_model(f"{model_series}-model.h5")

In [None]:
# save its weights
weights_name = 'weights/' + model_series + "-" + str(pretrain_epochs) + '-ae_weights.h5'
autoencoder.save_weights(weights_name)

In [None]:
# use the weights learned by the encoder to encode the data to a representation (embedding)
X_encoded = encoder.predict(X)

In [None]:
X_reduced = learn_manifold(X_encoded, umap_neighbors=30, 
                           umap_dim=int(encoded_dimensions/2))

In [None]:
# this is the data that we need to cluster
labels = hdbscan.HDBSCAN(
    min_samples=100,
    min_cluster_size=1000,
).fit_predict(X_reduced)

In [None]:
unique, counts = np.unique(labels, return_counts=True)
print (np.asarray((unique, counts)).T)

In [None]:
#important to note that the clustering was performed on the result of UMAP
# but the 2 dim lowering here (in order to generate the plot - was performed on the result of the encoder only)
reducer = umap.UMAP(n_components=2)
embedding = reducer.fit_transform(X_encoded)

In [None]:
fig = plt.figure(figsize=(12,8))
plt.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1], c=labels, 
            cmap='tab20c')

viz_clusters = pd.DataFrame(embedding)
viz_clusters['cluster'] = labels

for row in viz_clusters.groupby('cluster').mean().reset_index().values:
    label = f'CLUSTER: {row[0]}'
    plt.annotate(label, (row[1], row[2]), textcoords="offset points", 
                 fontsize=12,  xytext=(25,0), ha='center') 