## Running Iterative Clustering
This jupyter notebook can be used to run iterative clustering on large data sets

### Before you get started
#### Harddrive space
This runs best on a machine with a large local ssd. Here is an example of the storage space you'll need
* 250GB - raw_counts.h5ad
* 250GB - normalized.h5ad (if you already have this file, you don't need the raw_counts.h5ad)
* 250GB - space used by iterative clustering to store temporary files

#### Running Jupyter Notebook through ssh
You can run jupyter notebook through SSH! 
1) ssh into the remote machine you want to run it on
2) activate your transcriptomic clustering environment
3) run `nohup jupyter notebook -no-browser --port=1234 &`. This will start up the jupyter notebook server and prevent it from ending if your SSH connection gets terminated. Alternatively, you could use programs like `tmux` or `screen`. Copy the address it shows for step 5
4) back on your machine, run `ssh -NL 1234:localhost:1234 username@remote-machine` to open an SSH tunnel
5) enter the address from step 4 (should be something like 'http://localhost:1234/?token=8d186032bbbe095b294789e863b065a546fcc15b68683c99' Now you should be able to interact with the notebook on the remote machine!

#### Temporary Directories
Because AnnData doesn't support multiple views of filebacked data (e.g. subset=adata[1:5], subset[1:2]), we have to create temporary files for each cluster until we can store the whole cluster into memory. We store these files in the temporary directory - it will mostly cleanup after itself, but always check and remove old tmp files to keep your harddrive free

#### Run normalized data first
and save it so you don't need to rerun if you restart iterative clustering. It takes 60-90 minutes to normalize the data, but you don't need to repeat it if it's already been created. Just start directly at iterative clustering.

#### If SSH disconnects
Just reopen the SSH Tunnel (step 4 and 5), and you'll be able to see the notebook. However, due to a known unresolved issue, when you reopen the notebook you will no longer get log messages from iter_clust. As a work around, you can monitor progress with commands like `top`, `ls` the temporary directory for new files being created, or you can try [this workaround](https://github.com/jupyter/jupyter/issues/83#issuecomment-622984009)

In [None]:
%config Application.log_level='INFO'
import logging
logging.getLogger().setLevel(logging.INFO)

In [None]:
import tempfile
import os
import shutil
import json

import matplotlib.pyplot as plt

import numpy as np
import scipy as scp
import scanpy as sc
import transcriptomic_clustering as tc
from transcriptomic_clustering.iterative_clustering import (build_cluster_dict, iter_clust, OnestepKwargs)

In [None]:
# Setup input/output files
output_file = os.path.expanduser('clusters.json')

path_to_adata = './data/tasic2016counts_sparse.h5ad'
adata = sc.read_h5ad(path_to_adata, backed='r')

In [None]:
# Set memory params
tc.memory.set_memory_limit(GB=1)
tc.memory.allow_chunking = True

In [None]:
# Assign kwargs. Any unassigned args will be set to their respective function defaults
merge_clusters_kwargs = {
    'thresholds': {
        'q1_thresh': 0.5,
        'q2_thresh': None,
        'cluster_size_thresh': 15,
        'qdiff_thresh': 0.7,
        'padj_thresh': 0.05,
        'lfc_thresh': 1.0,
        'score_thresh': 200,
        'low_thresh': 1
    },
    'de_method': 'ebayes'
}
onestep_kwargs = OnestepKwargs(merge_clusters_kwargs=merge_clusters_kwargs)

In [None]:
# Remove old tmp_dir and make new one
try:
    shutil.rmtree(tmp_dir)
except NameError as e:
    pass # tmp_dir didn't exist
tmp_dir = tempfile.mkdtemp()

In [None]:
# normalize adata
norm_adata_path = os.path.join(tmp_dir, 'normalized.h5ad')
normalized_adata = tc.normalize(adata,copy_to=norm_adata_path)

In [None]:
# Run clustering
clusters = iter_clust(
    normalized_adata,
    min_samples=4,
    onestep_kwargs=onestep_kwargs,
    random_seed=123,
    tmp_dir=tmp_dir
)
cluster_dict = build_cluster_dict(clusters)

In [None]:
clusters

In [None]:
cluster_by_obs = np.zeros(normalized_adata.n_obs, dtype=int)
for cluster, obs in cluster_dict.items():
    cluster_by_obs[obs] = cluster
cluster_means, _, _ = tc.get_cluster_means(normalized_adata, cluster_dict, cluster_by_obs)
linkage, labels = tc.hclust(cluster_means)

In [None]:
%matplotlib inline

plt.figure()
dn = scp.cluster.hierarchy.dendrogram(linkage, labels=labels)
plt.show

In [None]:
with open(output_file, 'w') as f:
    json.dump(cluster_dict, f)

In [None]:
shutil.rmtree(tmp_dir)