## Step four: Clustering (binning) the latent representation

__The role of clustering in Vamb__

Fundamentally, the process of binning is just clustering sequences based on some of their properties. The purpose of encoding the contigs to a lossy latent representation is to ease the process of clustering because contigs with similar properties are placed close together in latent space, and the latent space is smaller than the input feature space.

With the latent representation conveniently represented by an (n_contigs x n_features) matrix, you could use any clustering algorithm to cluster them (such as the ones in `sklearn.cluster`). In practice though, you have perhaps a million contigs and prior constrains on the diameter, shape and size of the clusters, so non-custom clustering algorithms will probably be slow and inaccurate.

The module `vamb.cluster` implements a simple and fast iterative medoid clustering algorithm. It is well suited for spherical clusters with a maximum size and for many samples. The algorithm is similar, but subtly different from that used in the metagenomic binner Canopy:

    Clustering algorithm:
    (1): Pick random seed observation S
    (2): Define inner_obs(S) = all observations with Pearson distance from S < INNER
    (3): Sample MOVES observations I from inner_obs
    (4): If any inner_obs(i) > inner_obs(S) for i in I: Let S be i, go to (2)
         Else: Outer_obs(S) = all observations with Pearson distance from S < OUTER
    (5): Output outer_obs(S) as cluster, remove inner_obs(S) from observations
    (6): If no more observations or MAX_CLUSTERS have been reached: Stop
         Else: Go to (1)

__Determining clustering threshold__

You will notice that this algorithm depends on the parameters `INNER` and `OUTER`. This corresponds to the thresholds for Pearson distance, any contigs within which is considered part of the same bin. Getting this measure right is crucial for the binning to work well. Put them too low, and the bins will be highly fragmented. Too high, and distinct genomes will be binned together.

In order to estimate a good threshold for this distance, we have written the module `threshold`, in where there is the function `getthreshold`:

---

In [None]:
help(vamb.threshold.getthreshold)

---
This function samples multiple random contigs (1000 by default) and calculates the distance to all other contigs. Sometimes, these distances separate neatly into a small group of close contigs and all the other contigs. For sampled contigs where this is the case, `getthreshold` records the distance which separates those two groups, and then it returns the median of those values.

The world isn't always so neat however, and sometimes the clusters are not very well separated - there is no distance between the "close" contigs and the "far" ones where the contig density drops to zero - i.e. there are contigs at any given distance from the sampled one. When that's the case, `getthreshold` uses the first distance where you see a dip in the contig density and flags this distance as "not well separated". This is usually a little smaller than the valley of the dip, because the lack of separation means we would get too much noise if we close the valley as the threshold.

Sometimes it's even worse and there isn't even a dip in density, the other contigs appear to be randomly distributed around the sampled contigs. For those contigs, it does not find any threshold.

The `support` and `separation` measures the fraction of contigs for which any threshold is found, and which are "well separated", reprectively. If too many  (> 25%, we believe) has no threshold, the latent representation failed to properly separate the bins. If too many are not well separated, the clustering will work but perform suboptimally.

---

In [None]:
_ = vamb.threshold.getthreshold(latent, samples=1000)
threshold, support, separation = _

print('Estimated threshold:', round(threshold, 3))
print('Support:', round(support * 100, 1), '%')
print('Separation:', round(separation * 100, 1), '%')

---
Uh oh, only 5.3% separation. Around 50% support, so it *can* find thresholds, it just relies on finding the "dip" in the density as mentioned above.  This means we shouldn't have so high hopes for the quality of the bins but it should be totally rubbish. Now, to the clustering itself:

In Vamb, we have implemented the algorithm as two distinct functions: 

* `vamb.cluster.cluster`, simply clusters a matrix, and so scales approximately O(n<sup>2</sup>).

* `vamb.cluster.tandemcluster` does some very rough preclustering and then clusters each precluster using `vamb.cluster.cluster`. Each observation is then assigned uniquely to the largest cluster it's a member of. This scales better with number of contigs, but is also significantly less accurate.

You can use the slow-but-accurate with up to one or two million contigs depending on your patience or ~10 million contigs if you're alright with running it for days.

The heavy lifting here is done in Numpy, so it might be worth making sure the BLAS library your Numpy is using is fast. You can check it with `numpy.__config__.show()` and if it says anything other than `NOT AVAILABLE` under the `mkl` or `openblas` entries, you're golden.

---

In [None]:
help(vamb.cluster.cluster)

import numpy as np

# As written in the help above, labels must be a Numpy array!
labels = np.array(contignames)

# Unlike tandemcluster, which outputs the dictionary directly,
# the output of cluster is a generator
cluster_iterator = vamb.cluster.cluster(latent, labels, threshold)

clusters = dict()
for medoid, contigs in cluster_iterator:
    clusters[medoid] = contigs

print('Last key:', medoid, '(of type:', type(medoid), ')')
print('Type of values:', type(contigs))
print('First element of value:', next(iter(contigs)), 'of type:', type(next(iter(contigs))))

## Step five: Postprocessing the clusters

We haven't written any postprocessing modules because how to postprocess really depends on what you're looking for in your data.

One of the greatest weaknesses of Vamb - probably of metagenomic binners in general - is that the bins tend to be highly fragmented. You'll have lots of tiny bins, some of which are legitimate (viruses, plasmids), but most are parts of larger genomes that didn't get binned properly.

We're in the process of developing a tool for annotating, cleaning and merging bins based on phylogenetic analysis of the genes in the bins. That would be extremely helpful, but for now, we'll have to use more crude approaches:

We throw away all bins with less than 250,000 basepairs.

---

In [None]:
# First let's make a contignames: length dict
lengthof = dict(zip(contignames, lengths))

# Now filter away the small bins
filtered_bins = dict()

for medoid, contigs in clusters.items():
    binsize = sum(lengthof[contig] for contig in contigs)
    
    if binsize >= 250000:
        filtered_bins[medoid] = contigs

print('Number of bins before filtering:', len(clusters))
print('Number of bins after filtering:', len(filtered_bins))

---
Now, let's save the clusters to disk. For this we will use two writer functions:

1) `vamb.cluster.writeclusters`, that writes which clusters contains which contigs to a simple tab-separated file, and

2) `vamb.vambtools.writebins`, that writes FASTA files corresponding to each of the bins to a directory.

We will need to load all the contigs belonging to any bin into memory to use `vamb.vambtools.writebins`. If your bins don't fit in memory, sorry, you gotta find another way to make those FASTA bins.

The cluster name when printing either way will be the dictionary key of the bins. Right now, our bins have names like `s101_NODE_663_length_11806_cov_4.19433` - not exactly poetic. We'll rename the bins first.

---

In [None]:
# Rename bin keys to something less horrible as a file name
filtered_bins = {'cluster_' + str(i+1): v for i, v in enumerate(filtered_bins.values())}

with open('/home/jakni/Downloads/example/bins.tsv', 'w') as file:
    vamb.cluster.writeclusters(file, filtered_bins)

# Only keep contigs in any filtered bin in memory
allcontigs = set.union(*filtered_bins.values())

with open('/home/jakni/Downloads/example/contigs.fna', 'rb') as file:
    fastadict = vamb.vambtools.loadfasta(file, keep=allcontigs)
    
vamb.vambtools.writebins('/home/jakni/Downloads/example/bins/', filtered_bins, fastadict)