# TO DO:

Rewrite this notebook:

1) Installation - elide the importing to Python part

2) How to use from command line

3) How to use from Python w. examples

# Installation

Since Python is interpreted and cross-platform you just have to get the files on your computer:

__If you have `git` installed__:

    [jakni@nissen:scripts]$ git clone https://github.com/jakobnissen/vamb vamb
    
__If you don't__

You then presumably have access to the vamb directory with this notebook, so just put it wherever:

    [jakni@nissen:scripts]$ cp -r /path/to/vamb/directory vamb

# Quickstart

Take a brief look on the options with:

    [jakni@nissen:scripts]$ python path/to/vamb/runvamb.py --help

Do the defaults look alright? They probably do, but you might want to check number of processes to launch, GPU acceleration and whether you want the faster `tandemclustering` option enabled.

Then just do:

    [jakni@nissen:scripts]$ python path/to/vamb/runvamb.py outdir contig.fna path/to/bamfiles/*.bam

# Prerequisites

Like other metagenomic binners, Vamb relies on two properties of the DNA sequences to be binned:

* The kmer-composition of the sequence (here tetranucleotide frequency, *TNF*).
* The abundance of the contigs in each sample (the *depth* or the *RPKM*), and

So before you can run Vamb, you need to have files from which Vamb can calculate these values.

* TNF is calculated from a regular fasta file of DNA sequences.
* Depth is calculated from BAM-files of mapping reads to that same fasta file.

The observed values for both of these measures become uncertain when the sequences is too short due to the law of large numbers. Therefore, Vamb works poorly on short sequences.

With fewer samples (up to 100), we recommend using contigs from an assembly with a minimum contig length cutoff of ~2000-ish basepairs. With many samples, the number of contigs become overwhelming. The better approach is to split the dataset up into smaller chuncks and bin them independently.

If needed, Vamb *can* also work on shorter sequences such as genes, which are more easily homology reduced and thus can support hundreds of samples.

There are situations where you can't just filter the fasta file, maybe because you have already spent tonnes of time getting those BAM files and you're not going to remap if your life depended on it, or because your fasta file contains genes and so removing all entries less than e.g. 2000 bps is a bit too much to ask.

In those situations, you can still pass the argument `minlength` if you want to have Vamb ignore the smaller contigs. This is not ideal, since the smaller, contigs will still have recruited some reads during mapping which are then not mapped to the larger contigs, but it can work alright.


### Recommended preparation

__1) Preprocess the reads and check their quality__

We recommend AdapterRemoval combined with FastQC for this.

__2) Assemble each sample individually OR co-assemble and get the contigs out__

We recommend using metaSPAdes on each sample individually.

__3) Concatenate the FASTA files together while making sure all contig headers stay unique__

We recommend prepending the sample name to each contig header from that sample.

__4) Remove all small contigs from the FASTA file__

There's a tradeoff here between a too low cutoff, retaining hard-to-bin contigs which adversely affects the binning of all contigs, and throwing out good data. We recommend choosing a length cutoff of ~2000 bp.

__5) Map the reads to the FASTA file to obtain 6 .bam files__

We have used BWA MEM for mapping, fully aware that it is not suited for this task. In theory, any mapper that produces a BAM file with an alignment score tagged 'AS:i' and multiple secondary hits tagged 'XA:Z' can work.

___

In this tutorial, we have the two relevant prerequisite files in the directory `/home/jakni/Downloads/example`:

* `contigs.fna` - The filtered FASTA contigs which were mapped against, and
* `bamfiles/*.bam` - The 6 .bam files from mapping the reads to the contigs above.

# Running from command line

You can run either the entire pipeline from commandline, or each module independently.

---

__For the entire pipeline__, you need to use the `runvamb.py` script:

    [jakni@nissen:~]$ python Documents/scripts/vamb/runvamb.py --help
    usage: python runvamb.py OUTPATH FASTA BAMPATHS [OPTIONS ...]

    Run the Vamb pipeline.

    Creates a new direcotry and runs each module of the Vamb pipeline in the
    new directory. Does not yet support resuming stopped runs - in order to do so,
    
    [ lines elided ]

You use it like this:

    [jakni@nissen:~] python path/to/vamb/runvamb.py output_directory contig.fna path/to/bamfiles/*.bam
    
__For each module__, you find the relevant script:

    [jakni@nissen:~]$ python Documents/scripts/vamb/parsecontigs.py --help
    usage: parsecontigs.py contigs.fna(.gz) tnfout lengthsout

    Calculate z-normalized tetranucleotide frequency from a FASTA file.
    
    [ lines elided ]

# Detailed walkthrough and running from the Python interpreter

The Vamb pipeline consist of a handful of tasks each which have a dedicated module:

---
1) Parse fasta file and get TNF of each sequence, as well as sequence length and names

2) Parse the BAM files and get depth estimate for each sequence in the fasta file

3) Train a VAE with the depths and TNF matrices

4) Encode the depths and TNF matrices using the VAE

5) Cluster the encoded inputs to metgenomic bins

---
In this walkthrough, we will go through each step in more detail from within the Python interpreter. We will explain what each step does. With this knowledge, you should be able to extend Vamb relatively easily.

---

## Step zero: Importing Vamb and getting help

First step is to get Vamb imported
    
    [jakni@nissen:~]$ python
    >>> import vamb
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ModuleNotFoundError: No module named 'vamb'
    >>> # We're not in the directory containing the vamb directory.
    >>> # That means the directory containing the vamb dir is not in out sys.path.
    >>> # Either move the vamb directory to one of your sys.path dirs 
    >>> # or add the vamb directory to sys.path. We'll do the latter.
    >>> import sys
    >>> sys.path.append('/home/jakni/Documents/scripts/')
    >>> import vamb
    >>> # No error message - success!

You'll almost certianly need help when using Vamb (we wish it was so easy you didn't, but making user friendly software is *hard!*).

Luckily, there's the built-in `help` function in Python.

---

`>>> help(vamb)`
    
    Help on package vamb:

    NAME
        vamb - Variational Autoencoder for Metagenomic Binning

    DESCRIPTION
        Vamb does what it says on the tin - bins metagenomes using a variational autoencoder.
        
    [ lines elided ]
    
        General workflow:
        1) Filter contigs by size using vamb.filtercontigs
        2) Map reads to contigs to obtain BAM file
        3) Calculate TNF of contigs using vamb.parsecontigs
        4) Create RPKM table using vamb.parsebam
        5) Train autoencoder using vamb.encode
        6) Cluster latent representation using vamb.cluster
    
    [ lines elided ]
    
---
    
The `PACKAGE CONTENTS` under `help(vamb)` is just a list of all importable files in the `vamb` directory - some of these really shouldn't be imported, so ignore that.

---
You can also get help for the modules:

`>>> help(vamb.cluster)`

    Help on module vamb.cluster in vamb:

    NAME
        vamb.cluster - Iterative medoid clustering of Numpy arrays.

    DESCRIPTION
        Implements two core functions: cluster and tandemcluster, along with the helper
        functions writeclusters and readclusters.
        For all functions in this module, a collection of clusters are represented as
        a {clustername, set(elements)} dict.

        Clustering algorithm:
    
    [ lines elided ]
        
---
And for functions:

`>>> help(vamb.cluster.tandemcluster)`

    Help on function tandemcluster in module vamb.cluster:

    tandemcluster(matrix, labels, inner, outer=None, max_steps=15, spearman=False)
        Splits the datasets, then clusters each partition before merging
        the resulting clusters. This is faster, especially on larger datasets, but
        less accurate than normal clustering.

        Inputs:
            matrix: A (obs x features) Numpy matrix of values
            labels: Numpy array with labels for matrix rows. None or 1-D array
            inner: Optimal medoid search within this distance from medoid
            outer: Radius of clusters extracted from medoid. If None, same as inner
            max_steps: Stop searching for optimal medoid after N futile attempts
            spearman: Use Spearman, not Pearson correlation

        Output: {medoid: set(labels_in_cluster) dictionary}

In [1]:
import sys
sys.path.append('/home/jakni/Documents/scripts/')
import vamb

## Step one: Parse the FASTA file

If you forget what to do at each step, remember that `help(vamb)` said:

    General workflow:
    1) Filter contigs by size using vamb.filtercontigs
    2) Map reads to contigs to obtain BAM file
    3) Calculate TNF of contigs using vamb.parsecontigs
    
    [ lines elided ]

Okay, we already have filtered contigs, and we have mapped reads to them and gotten BAM files, so we begin with `vamb.parsecontigs`. How do you use that?

In [2]:
help(vamb.parsecontigs)

Help on module vamb.parsecontigs in vamb:

NAME
    vamb.parsecontigs - Calculate z-normalized tetranucleotide frequency from a FASTA file.

DESCRIPTION
    Usage:
    >>> with open('/path/to/contigs.fna', 'rb') as filehandle
    ...     tnfs, contignames, lengths = read_contigs(filehandle)

FUNCTIONS
    read_contigs(byte_iterator, minlength=100)
        Parses a FASTA file open in binary reading mode.
        
        Input:
            byte_iterator: Iterator of binary lines of a FASTA file
            minlength[100]: Ignore any references shorter than N bases 
        
        Outputs:
            tnfs: A (n_FASTA_entries x 136) matrix of tetranucleotide freq.
            contignames: A list of contig headers
            lengths: A list of contig lengths

DATA
    TNF_HEADER = '#contigheader\tAAAA/TTTT\tAAAC/GTTT\tAAAG/CTTT\tAAAT...A...
    __cmd_doc__ = 'Calculate z-normalized tetranucleotide frequency...eoti...

FILE
    /home/jakni/Documents/scripts/vamb/parsecontigs.py




---
I use `vamb.parsecontigs.read_contigs` with the inputs and outputs as written:

---

In [4]:
# Open the file in binary mode - you can use the vamb.vambtools.Reader to read
# from normal or gzipped file seamlessly. Here I just use the open function
with open('/home/jakni/Downloads/example/contigs.fna', 'rb') as filehandle:
    tnfs, contignames, lengths = vamb.parsecontigs.read_contigs(filehandle)

In [5]:
# Let's have a look at the resulting data

print('Type of tnfs:', type(tnfs), 'of dtype', tnfs.dtype)
print('Shape of tnfs:', tnfs.shape, end='\n\n')

print('Type of contignames:', type(contignames))
print('Length of contignames:', len(contignames), end='\n\n')

print('First 10 elements of contignames:')
for i in range(10):
    print(contignames[i])
    
print('\n')
    
print('Type of lengths:', type(lengths))
print('Length of lengths:', len(lengths), end='\n\n')

print('First 10 elements of lengths:')
for i in range(10):
    print(lengths[i])

Type of tnfs: <class 'numpy.ndarray'> of dtype float32
Shape of tnfs: (39551, 136)

Type of contignames: <class 'list'>
Length of contignames: 39551

First 10 elements of contignames:
s30_NODE_1_length_245508_cov_18.4904
s30_NODE_2_length_222690_cov_39.7685
s30_NODE_3_length_222459_cov_20.3665
s30_NODE_4_length_173155_cov_20.1181
s30_NODE_5_length_161239_cov_20.1237
s30_NODE_6_length_157102_cov_20.734
s30_NODE_7_length_156768_cov_44.8078
s30_NODE_8_length_152691_cov_19.6759
s30_NODE_9_length_121154_cov_21.6491
s30_NODE_10_length_119726_cov_136.834


Type of lengths: <class 'numpy.ndarray'>
Length of lengths: 39551

First 10 elements of lengths:
245508
222690
222459
173155
161239
157102
156768
152691
121154
119726


---
It turns out that related organisms tend to share a similar kmer-distribution across most of their genome. The reason for that is not understood, even though it's believed that common functional motifs, GC-content and presence/absence of endonucleases explains some of the observed similary.

The `tnfs` is the tetranucleotide frequency - it's the frequency of the canonical kmer of each 4mer in the contig. The matrix is z-score normalized across contigs for each sample such that the frequency of e.g. 'AGGC' is measured relative to other contigs in that sample - this increases the signal-to-noise ratio.

We use 4-mers because there are 136 canonical 4-mer, which is an appropriate number of features to cluster - not so few that there's no signal and not so many it becomes unwieldy and the estimates of the frequencies become uncertain.

At this points, you should probably consider whether you can keep everything in memory. If not, all the relevant modules have reading and writing functions so you can dump the results to disk and delete them from memory. This is a small dataset, so there's no problem. With hundreds of samples and millions of contigs however, this becomes a problem, even though Vamb is fairly memory-friendly.

As a rule of thumb, the memory consumption for the most memory intensive step is approximately 8 × (n_samples + 136) × n_contigs bytes plus a little bit of overhead. If this is much lower than your RAM, don't worry about it. If it's within a factor 5 of your available RAM, you'll need to delete objects you don't need anymore.

In my example, I have 6 samples and 39551 contigs for a total memory usage of ~45 MB.

---

## Step two: Parsing the BAM files

In [3]:
# Again, we can use the help function to see what we need to do
help(vamb.parsebam.read_bamfiles)

Help on function read_bamfiles in module vamb.parsebam:

read_bamfiles(paths, minscore=50, minlength=100, processors=4)
    Spawns processes to parse BAM files and get contig rpkms.
    
    Input:
        path: Path to BAM file
        minscore [50]: Minimum alignment score (AS field) to consider
        minlength [100]: Ignore any references shorter than N bases 
        processors [all]: Number of processes to spawn
    
    Outputs:
        sample_rpkms: A {path: Numpy-32-float-RPKM} dictionary
        contignames: A list of contignames from first BAM header



---
We can see (in the default value for the `processes` argument) that the function detects 4 cores on my laptop. It will then spawn 4 parallel processes to read the BAM files. It's capped at 8 processes, because at that level, it almost certainly becomes I/O bound.

As with the `vamb.parsecontigs.read_contigs` function, I don't care about the `minlength` argument, since our fasta file is already filtered.

Lastly, the function ignores all alignments with alignment score less than 50 (as determined by the optional `AS:i` field in the BAM file). That seems reasonable here.

---

In [18]:
bamfiles = !ls /home/jakni/Downloads/example/bamfiles
bamfiles = ['/home/jakni/Downloads/example/bamfiles/' + p for p in bamfiles]
bamfiles

['/home/jakni/Downloads/example/bamfiles/e101.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e178.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e179.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e196.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e198.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e30.filtered.bam']

In [20]:
# That looks right.

# We already have the contignames, so who cares about saving that
sample_rpkms, _ = vamb.parsebam.read_bamfiles(bamfiles)
del _

In [9]:
print('Type of sample_rpkms:', type(sample_rpkms))

print('Content of the dict:')
print('First key:', next(iter(sample_rpkms.keys())))
print('First value:', next(iter(sample_rpkms.values())))

Type of sample_rpkms: <class 'dict'>
Content of the dict:
First key: /home/jakni/Downloads/example/bamfiles/e101.filtered.bam
First value: [0.11535063 0.17579393 0.58409214 ... 0.         0.         0.        ]


---
The idea here is that two contigs from the same genome will always be physically present together, and so they should have a similar abundance across all samples. Some contigs represent repeats like duplicated segments - these contigs should have a fixed ratio of abundance to other contigs. Thus, even when considering repeated contigs, there should be a tight Pearson correlation between abundances of contigs from the same genome.

The `vamb.parsebam` module takes a rather crude approach to estimating abundance, namely by simply counting the number of mapped reads to each contig, divided by total number of reads and the contig's length. This measure is in trancriptomics often called RPKM, *reads per kilobase per million mapped reads*. Other metagenomic binners like Metabat and Canopy uses an average of per-nucleotide depth of coverage instead. We do not believe there is any theoretical or practical advantage of using depth over RPKM. We will use the terms *depth* and *rpkm* interchangably.

The object we just created, `sample_rpkms` is the depth table in the form of a dictionary with one column (representing one BAM file) per entry. Just like the TNF, this needs to be converted to a (n_contigs x n_features) Numpy array in order for it to be used in the variational autoencoder.

We can use the function `vamb.parsebam.array_from_sample_rpkms` to do this. This will create a new array in memory while retaining `sample_rpkms`. Each object requires approximately 4 x n_contigs x n_samples bytes. If having two of these objects in memory will consume all your RAM, you can dump the `sample_rpkms` to disk, delete it, and reload it in a matrix using the functions `write_rpkms` and `array_fromnpz` in the `vamb.parsebam` module.

---

In [24]:
rpkms = vamb.parsebam.array_from_sample_rpkms(sample_rpkms, bamfiles)

---
Now, I tend to be a bit ~~paranoid~~<sup>careful</sup>, so if I loaded in 500 GB of BAM files, I'd want to save the work I have now in case something goes wrong - and we're about to fire up the VAE so lots of things can go wrong.

What importants objects do I have in memory right now?

* contignames: A list of contignames
* lengths: A list of contig lengths
* rpkms: A numpy array of rpkms
* tnfs: A numpy array of tnfs

I'm going to use `pickle` to save the Python list and `vamb.vambtools.write_npz` to save the Numpy arrays (the latter is just a wrapper for `numpy.savez_compressed`). Of course, I could have used pickle for it all.

---

In [25]:
import pickle

with open('/home/jakni/Downloads/example/contignames.pickle', 'wb') as file:
    pickle.dump(contignames, file, protocol=4)

with open('/home/jakni/Downloads/example/lengths.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, lengths)

with open('/home/jakni/Downloads/example/tnfs.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, tnfs)
    
with open('/home/jakni/Downloads/example/rpkms.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, rpkms)

## Step three: Train the autoencoder and encode imput data

Again, you can use `help` to see how to use the module

`help(vamb.encode)`

    Help on module vamb.encode in vamb:

    NAME
        vamb.encode - Encode a depths matrix and a tnf matrix to latent representation.

    DESCRIPTION
        Creates a variational autoencoder in PyTorch and tries to represent the depths
        and tnf in the latent space under gaussian noise.

        usage:
        >>> vae, dataloader = trainvae(depths, tnf) # Make & train VAE on Numpy arrays
        >>> latent = vae.encode(dataloader) # Encode to latent representation
        >>> latent.shape
        (183882, 40)
        
    [ lines elided ]
    
---
Aha, so we need to use the `trainvae` function first, then the `VAE.encode` method. You can call the `help` functions on those, but I'm not showing that here.

In [5]:
# I'm training just 5 epochs for this demonstration.
# When actually using the VAE, 200-300 epochs are suitable
vae, dataloader = vamb.encode.trainvae(rpkms, tnfs, nepochs=5, verbose=True)

Epoch: 1	Loss: 2.4766	BCE: 2.4632	MSE: 0.00394	KLD: 0.0094
Epoch: 2	Loss: 2.2839	BCE: 2.2691	MSE: 0.00599	KLD: 0.0088
Epoch: 3	Loss: 2.1273	BCE: 2.1102	MSE: 0.00837	KLD: 0.0087
Epoch: 4	Loss: 1.9752	BCE: 1.9554	MSE: 0.01141	KLD: 0.0083
Epoch: 5	Loss: 1.8483	BCE: 1.8260	MSE: 0.01442	KLD: 0.0079


---
The VAE encodes the high-dimensional (n_samples + 136 features) input data in a lower dimensional space (nlatent features). When training, it learns both the encoding scheme and attempts to reconstruct the input data given the latent representation influenced by gaussian noise.

The theory here is that the latent representation should be a more efficient encoding of the input data. If the input data for the contigs indeed do fall into bins, an efficient encoding should be to simply encode the bin they belong to, then use the "bin identity" to reconstruct the data. We add noise to prevent it from learning a huge number of slightly different bins, in the most extreme, each bin contains only one contig.

The loss of the VAE is the sum of three measures:

* Binary cross entropy (BCE) measures the dissimilarity of the reconstructed abundances to observed abundances
* Mean squared error (MSE) measures the dissimilary of reconstructed versus observed TNF
* Kullback-Leibler divergence (KLD) measures the dissimilarity between the standard normal distribution and the distribution of values sampled from the latent layer with the gaussian noise

At least in principle, the latter term incudes the VAE to not crazily overfit by imposing some sensible prior on the kind of encodings it can choose.

We can see the Mean Squared Error (which is the TNF-related loss) is rising these first 5 epochs, presumably as it sacrifices an efficient representation of the TNF in order to learn the depths (whose loss is BCE) better. This happens sometimes, and it's alright - after all, co-abundance usually contain more information that TNF, and so we have chosen the BCE to be several orders of magnitude over MSE in order for the VAE to be able to make this choice.

Okay, so now we have the trained `vae` and the `dataloader`. Let's feed the dataloader to the VAE in order to get the latent representation:

---

In [6]:
latent = vae.encode(dataloader)

print(latent.shape)

(39551, 40)


---
That's 39551 contigs each represented by the (non-noisy) value of 40 latent neurons.

Now we need to cluster this. But first, we must determine a proper clustering threshold.

---

## Step four: Determining the clustering threshold

__To be added when we've got a stable API for determining this__

In [7]:
threshold = 0.03

## Step five: Clustering (binning) the latent representation

Fundamentally, the process of binning is just clustering sequences based on some of their properties. The purpose of encoding the contigs to a lossy latent representation is to ease the process of clustering because contigs with similar properties are placed close together in latent space, and the latent space is smaller than the input feature space.

With the latent representation conveniently represented by an (n_contigs x n_features) matrix, you could use any clustering algorithm to cluster them (such as the ones in `sklearn.cluster`). In practice though, you have likely a few million contigs and prior constrains on the diameter, shape and size of the clusters.

The module `vamb.cluster` implements a simple and fast iterative medoid clustering algorithm. It is well suited for spherical clusters with a maximum size and for many samples. It is similar to the clustering algorithm used in the metagenomic binner Canopy.

    Clustering algorithm:
    (1): Pick random seed observation S
    (2): Define inner_obs(S) = all observations with Pearson distance from S < INNER
    (3): Sample MOVES observations I from inner_obs
    (4): If any inner_obs(i) > inner_obs(S) for i in I: Let S be i, go to (2)
         Else: Outer_obs(S) = all observations with Pearson distance from S < OUTER
    (5): Output outer_obs(S) as cluster, remove inner_obs(S) from observations
    (6): If no more observations or MAX_CLUSTERS have been reached: Stop
         Else: Go to (1)

We have implemented the algorithm in two functions: 

* `vamb.cluster.cluster`, simply clusters a matrix, and so scales approximately O(n<sup>2</sup>).

* `vamb.cluster.tandemcluster` does some very rough preclustering and then clusters each precluster using `vamb.cluster.cluster`. Each observation is then assigned uniquely to the largest cluster it's a member of. This scales better with number of contigs, but accuracy is lost in the preclustering step.

You can use the slow-but-accurate with up to one or two million contigs depending on your patience or ~10 million contigs if you're alright with running it for days.

The heavy lifting here is done in Numpy, so it might be worth making sure the BLAS library your Numpy is using is fast. You can check it with `numpy.__config__.show()` and if it says anything other than `NOT AVAILABLE` under the `mkl` or `openblas` entries, you're golden.

---

In [20]:
help(vamb.cluster.cluster)

Help on function cluster in module vamb.cluster:

cluster(matrix, labels, inner, outer=None, max_steps=15, spearman=False)
    Iterative medoid cluster generator. Yields (medoid), set(labels) pairs.
    
    Inputs:
        matrix: A (obs x features) Numpy matrix of values
        labels: Numpy array with labels for matrix rows. None or 1-D array
        inner: Optimal medoid search within this distance from medoid
        outer: Radius of clusters extracted from medoid. If None, same as inner
        max_steps: Stop searching for optimal medoid after N futile attempts
        spearman: Use Spearman, not Pearson correlation
    
    Output: Generator of (medoid, set(labels_in_cluster)) tuples.



In [8]:
labels = np.array(contignames)

# Unlike tandemcluster, which outputs the dictionary directly,
# the output of cluster is a generator
cluster_iterator = vamb.cluster.cluster(latent, labels, threshold)

clusters = dict()
for medoid, contigs in cluster_iterator:
    clusters[medoid] = contigs

In [7]:
print(medoid)
print('Type of values:', type(contigs))

NameError: name 'medoid' is not defined

## Step six: Postprocessing the clusters

We haven't written any postprocessing modules because how to postprocess really depends on what you're looking for in your data.

One of the greatest weaknesses of Vamb - probably of metagenomic binners in general - is that the bins tend to be highly fragmented. You'll have lots of tiny bins, some of which are legitimate (viruses, plasmids), but most are parts of larger genomes that didn't get binned properly.

Here, let's say we're only interested in bacteria. So we throw away all bins with less than 250,000 basepairs

---

In [9]:
# First let's make a contignames: length dict
lengthof = dict(zip(contignames, lengths))

# Now filter away the small bins
filtered_bins = dict()

for medoid, contigs in clusters.items():
    binsize = sum(lengthof[contig] for contig in contigs)
    
    if binsize >= 250000:
        filtered_bins[medoid] = contigs

In [10]:
print('Number of bins before filtering:', len(clusters))
print('Number of bins after filtering:', len(filtered_bins))

Number of bins before filtering: 6641
Number of bins after filtering: 113


---
Now, let's print them. For this we will use two writer functions:

1) `vamb.cluster.writeclusters`, that writes which clusters contains which contigs to a simple tab-separated file, and

2) `vamb.vambtools.writebins`, that writes FASTA files corresponding to each of the bins to a directory.

We will need to load all the contigs belonging to any bin into memory to use `vamb.vambtools.writebins`. If your bins don't fit in memory, sorry, you gotta find another way to make those FASTA bins.

---

In [12]:
with open('/home/jakni/Downloads/example/bins.tsv', 'w') as file:
    vamb.cluster.writeclusters(file, filtered_bins)

In [3]:
# Only keep contigs in any filtered bin in memory
allcontigs = set.union(*filtered_bins.values())

with open('/home/jakni/Downloads/example/contigs.fna', 'rb') as file:
    fastadict = vamb.vambtools.loadfasta(file, keep=allcontigs)
    
vamb.vambtools.writebins('/home/jakni/Downloads/example/bins/', filtered_bins, fastadict)

## (If you have a reference: Benchmark the output)

For this to make any sense, you need to have a *reference*, that is, a list of bins that are deemed true and complete.

The reference could be a {clustername: set(contigs)} dict along with a {contigname: length} dict, just like the `clusters` and `lengthof` we made. It could also be a tab-separated file with (clustername, contigname, length)-rows, one row per contig.

Now, I have no reference for this dataset, so I created a reference file completely randomly:

In [26]:
!head /home/jakni/Downloads/example/reference.tsv

# binname contigname length
0	s198_NODE_2960_length_5085_cov_5.30505	5085
0	s30_NODE_9489_length_2530_cov_2.23365	2530
0	s179_NODE_2638_length_5642_cov_2.52661	5642
0	s30_NODE_160_length_42890_cov_12.914	42890
0	s198_NODE_4819_length_3620_cov_4.35672	3620
0	s178_NODE_2065_length_4779_cov_4.00513	4779
0	s198_NODE_1167_length_8851_cov_6.04376	8851
0	s198_NODE_7205_length_2698_cov_5.10081	2698
0	s198_NODE_5233_length_3401_cov_5.00303	3401


---
We of course expect the benchmark to show we have at most a handful of very incomplete bins, since the reference is random.

---

In [27]:
reference_path = '/home/jakni/Downloads/example/reference.tsv'

with open(reference_path) as filehandle:
    reference = vamb.benchmark.Reference.fromfile(filehandle)

---
We also need to instantiate the Observed bins (which we created above!), and a BenchMarkResult

---

In [31]:
# We could also do this from the bins.tsv we created in the previous section,
# but here we have the dictionary with bins in memory already.
observed = vamb.benchmark.Observed(filtered_bins, reference)

# Keyword-only arguments to make sure you don't accidentally swap them around.
# It'll raise an error if you use non-keyword arguments.
result = vamb.benchmark.BenchMarkResult(reference=reference, observed=observed)

In [41]:
result = vamb.benchmark.BenchMarkResult(reference=reference, observed=observed)

In [42]:
# Okay, how did we do?
result.printmatrix()

	Recall
Prec.	0.3	0.4	0.5	0.6	0.7	0.8	0.9	0.95
0.7	0	0	0	0	0	0	0	0
0.8	0	0	0	0	0	0	0	0
0.9	0	0	0	0	0	0	0	0
0.95	0	0	0	0	0	0	0	0
0.99	0	0	0	0	0	0	0	0


---
This matrix shows the number of reference bins where, for each value of recall and precision there is at least one observed bin that passes those criteria.

As expected (because the reference was randomly generated), the results are terrible - in fact, they couldn't be worse.

To check what else the BenchMarkResult measures, check `help(vamb.benchmark.BenchMarkResult)`

---

## Now what?

Here's some ideas:

__Quality control your bins with CheckM__

CheckM tries to asses the contimation and completeness of the given bins and works pretty well. It also tries to classify them taxonomically... with more limited success.

__Improve the bins with Stranglerfig or RefineM__

These tools inspect your bins and refines them by reassigning contigs between bins.