# Installation

First step is to get Vamb on your computer.

__If you have `git` installed__:

    [jakni@nissen:Downloads]$ # Clone Vamb from GitHub into Downloads/vamb

    [jakni@nissen:Downloads]$ git clone https://github.com/jakobnissen/vamb vamb
    
__If you don't__

    [jakni@nissen:Downloads]$ # You then presumably have access to a Vamb directory

    [jakni@nissen:Downloads]$ cp -r /path/to/vamb/directory vamb
    
---
Now you have Vamb on your computer. Time to get it imported
    
    [jakni@nissen:~]$ python
    >>> import vamb
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ModuleNotFoundError: No module named 'vamb'
    >>> # We're not in the directory containing the vamb directory.
    >>> # That means the directory containing the vamb dir is not in out sys.path.
    >>> # Either move the vamb directory to one of you sys.path dirs 
    >>> # or add the vamb directory to sys.path. We'll do the latter.
    >>> import sys
    >>> sys.path.append('/home/jakni/Downloads')
    >>> import vamb
    >>>

# Getting help

You'll almost certianly need help when using Vamb (we wish it was so easy you didn't, but making user friendly software is hard!).

Luckily, there's the built-in `help` function in Python.

---

`>>> help(vamb)`
    
    Help on package vamb:

    NAME
        vamb - Variational Autoencoder for Metagenomic Binning

    DESCRIPTION
        Vamb does what it says on the tin - bins metagenomes using a variational autoencoder.
        
    [ lines elided ]
    
        General workflow:
        1) Filter contigs by size using vamb.filtercontigs
        2) Calculate TNF using vamb.parsecontigs
        3) Create RPKM table using vamb.parsebam
        4) Train autoencoder using vamb.encode
        5) Cluster latent representation using vamb.cluster
    
    [ lines elided ]
    
---
    
The `PACKAGE CONTENTS` under `help(vamb)` is just a list of all importable files in the `vamb` directory - some of these really shouldn't be imported, so ignore that.

---
You can also get help for the modules:

`>>> help(vamb.cluster)`

    Help on module vamb.cluster in vamb:

    NAME
        vamb.cluster - Iterative medoid clustering of Numpy arrays.

    DESCRIPTION
        Implements two core functions: cluster and tandemcluster, along with the helper
        functions writeclusters and readclusters.
        For all functions in this module, a collection of clusters are represented as
        a {clustername, set(elements)} dict.

        Clustering algorithm:
    
    [ lines elided ]
        
---
And for functions:

`>>> help(vamb.cluster.tandemcluster)`

    Help on function tandemcluster in module vamb.cluster:

    tandemcluster(matrix, labels, inner, outer=None, max_steps=15, spearman=False)
        Splits the datasets, then clusters each partition before merging
        the resulting clusters. This is faster, especially on larger datasets, but
        less accurate than normal clustering.

        Inputs:
            matrix: A (obs x features) Numpy matrix of values
            labels: Numpy array with labels for matrix rows. None or 1-D array
            inner: Optimal medoid search within this distance from medoid
            outer: Radius of clusters extracted from medoid. If None, same as inner
            max_steps: Stop searching for optimal medoid after N futile attempts
            spearman: Use Spearman, not Pearson correlation

        Output: {medoid: set(labels_in_cluster) dictionary}

# A simple workflow example from within Python

You begin with some some FASTQ files from, say, 6 samples. First you do the following steps:

1) Preprocess the reads and check their quality

2) Assemble each sample individually OR co-assemble and get the contigs out

3) Concatenate the FASTA files together while making sure all contig headers stay unique

Now, like other metagenomic binners of contigs, Vamb relies on two properties of the contigs: The abundance of the contigs in each sample and the kmer-composition of the contigs. The observed values for both of these measures become uncertain when the contig is too small, so you should filter the small contigs away:

4) Remove all small contigs from the FASTA file (say, less than 2000 bp in length)

To estimate the abundance of the contigs, you need to map the reads from each sample to the FASTA file. When using BWA, don't filter for unproperly paired reads or minimum alignment score.

5) Map the reads to the FASTA file to obtain 6 .bam files

Now, maybe you can't filter the FASTA file. Maybe you have already spent tonnes of time getting those BAM files and you're not going to remap if your life depended on it. Maybe your FASTA file contains genes, not contigs, and so removing all entries less than e.g. 2000 bps is a bit too much to ask.

That's fair enough. You can ignore any contigs lower than a certain length with the `minlength` keyword arguments of the `read_contigs` and the `read_bamfiles` functions (shown later). This is not ideal, since the smaller, ignored contigs will still have recruited some reads during mapping which are then not mapped to the larger contigs.
___

This gives us the following results, here put in `/home/jakni/Downloads/example`:

* `contigs.fna` - The filtered FASTA contigs which were mapped against, and
* `bamfiles/*.bam` - The 6 .bam files from mapping the reads to the contigs above.



In [11]:
import sys
sys.path.append('/home/jakni/Documents/scripts/')
import vamb

## First, parse the FASTA file

How was it I did that again? Oh, right, the `read_contigs` function in the `parsecontigs` module.

In [12]:
help(vamb.parsecontigs.read_contigs)

Help on function read_contigs in module vamb.parsecontigs:

read_contigs(byte_iterator, minlength=100)
    Parses a FASTA file open in binary reading mode.
    
    Input:
        byte_iterator: Iterator of binary lines of a FASTA file
        minlength[100]: Ignore any references shorter than N bases 
    
    Outputs:
        tnfs: A (n_FASTA_entries x 136) matrix of tetranucleotide freq.
        contignames: A list of contig headers
        lengths: A list of contig lengths



In [24]:
# Open the file in binary mode - you can use the vamb.vambtools.Reader to read
# from normal or gzipped file seamlessly. Here I just use the open function
with open('/home/jakni/Downloads/example/contigs.fna', 'rb') as filehandle:
    tnfs, contignames, lengths = vamb.parsecontigs.read_contigs(filehandle)

In [25]:
# Let's have a look at the resulting data

print('Type of tnfs:', type(tnfs), 'of dtype', tnfs.dtype)
print('Shape of tnfs:', tnfs.shape, end='\n\n')

print('Type of contignames:', type(contignames))
print('Length of contignames:', len(contignames), end='\n\n')

print('First 10 elements of contignames:')
for i in range(10):
    print(contignames[i])
    
print('Type of lengths:', type(lengths))
print('Length of lengths:', len(lengths), end='\n\n')

print('First 10 elements of lengths:')
for i in range(10):
    print(lengths[i])

Type of tnfs: <class 'numpy.ndarray'> of dtype float32
Shape of tnfs: (39551, 136)

Type of contignames: <class 'list'>
Length of contignames: 39551

First 10 elements of contignames:
s30_NODE_1_length_245508_cov_18.4904
s30_NODE_2_length_222690_cov_39.7685
s30_NODE_3_length_222459_cov_20.3665
s30_NODE_4_length_173155_cov_20.1181
s30_NODE_5_length_161239_cov_20.1237
s30_NODE_6_length_157102_cov_20.734
s30_NODE_7_length_156768_cov_44.8078
s30_NODE_8_length_152691_cov_19.6759
s30_NODE_9_length_121154_cov_21.6491
s30_NODE_10_length_119726_cov_136.834
Type of lengths: <class 'list'>
Length of lengths: 39551

First 10 elements of lengths:
245508
222690
222459
173155
161239
157102
156768
152691
121154
119726


---
The tnfs is the tetranucleotide frequency - it's the frequency of the canonical kmer of each 4mer in the contig. The matrix is normalized across samples such that the frequency of e.g. 'AGGC' is measured relative to other contigs.

Here, you should probably consider whether you can keep everything in memory. If not, all the relevant modules have reading and writing functions so you can dump the results to disk and delete them from memory. This is a small dataset, so there's no problem. With hundreds of samples and millions of contigs, this becomes a problem, even though Vamb is fairly memory-friendly.

As a rule of thumb, the memory consumption for the most memory intensive step is approximately 8 \* (n_samples + 136) \* n_contigs bytes plus a little bit of overhead. If this fits snugly in your RAM with some room to spare, don't worry about the RAM.

In my example, I have 6 samples and 39551 contigs for a total memory usage of ~45 MB.

---

## Parsing the BAM files

In [9]:
help(vamb.parsebam.read_bamfiles)

Help on function read_bamfiles in module vamb.parsebam:

read_bamfiles(paths, minscore=50, minlength=2000, processors=4)
    Spawns processes to parse BAM files and get contig rpkms.
    
    Input:
        path: Path to BAM file
        minscore [50]: Minimum alignment score (AS field) to consider
        minlength [2000]: Discard any references shorter than N bases 
        processors [all]: Number of processes to spawn
    
    Outputs:
        sample_rpkms: A {path: Numpy-32-float-RPKM} dictionary
        contignames: A list of contignames from first BAM header



---
We can see (in the default value for the `processors` argument) that the function detects 4 CPUs on my laptop. That means we can read 4 BAM files at a time. Each BAM file being read takes up some memory, but unless you have a machine with tonnes of CPUs and little RAM, that's not going to be an issue. In either case, it's probably going to be IO bound when we give it 8+ cores, so that's unlikely to become an issue.

This step is not super optimized, but the largest file is just 1.8 GB, so it takes a few minutes

---

In [17]:
bamfiles = !ls /home/jakni/Downloads/example/bamfiles

In [18]:
bamfiles = ['/home/jakni/Downloads/example/bamfiles/' + p for p in bamfiles]
bamfiles

['/home/jakni/Downloads/example/bamfiles/e101.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e178.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e179.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e196.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e198.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e30.filtered.bam']

In [20]:
# That looks right.

# We already have the contignames, so who cares about saving that
sample_rpkms, _ = vamb.parsebam.read_bamfiles(bamfiles)
del _

In [9]:
print('Type of sample_rpkms:', type(sample_rpkms))

print('Content of the dict:')
print('First key:', next(iter(sample_rpkms.keys())))
print('First value:', next(iter(sample_rpkms.values())))

Type of sample_rpkms: <class 'dict'>
Content of the dict:
First key: /home/jakni/Downloads/example/bamfiles/e101.filtered.bam
First value: [0.11535063 0.17579393 0.58409214 ... 0.         0.         0.        ]


---
This is the depths (RPKM) table, here in the form of a dict with one item per column of the dict.

We need to convert this to a Numpy array to feed it to the VAE.

This is where RAM might be an issue, as you'll have two copies of the depths table in memory. To get around that, you can use the `write_rpkms` and `array_fromnpz` functions to write to disk, then read it in in a single matrix.

---

In [62]:
help(vamb.parsebam.array_from_sample_rpkms)

Help on function array_from_sample_rpkms in module vamb.parsebam:

array_from_sample_rpkms(sample_rpkms, columns)
    Creates a (n-contigs x n-bamfiles) array from a sample_rpkms
    (expected to be a {path: Numpy-32-float-RPKM} dictionary)
    
    Inputs: 
        sample_rpkms: A {path: Numpy-32-float-RPKM} dictionary
        columns: Names of BAM file paths in sample_rpkms object in correct order
        
    Output: A (n-contigs x n-bamfiles) array with each column being the corre-
    sponding array from sample_rpkms, normalized so each row sums to 1



In [24]:
rpkms = vamb.parsebam.array_from_sample_rpkms(sample_rpkms, bamfiles)

---
Now, I tend to be a bit ~~paranoid~~<sup>careful</sup>, so if I loaded in 500 GB of BAM files, I'd want to save the work I have now in case something goes wrong - and we're about to fire up the VAE so lots of things can go wrong.

What importants objects do I have in memory?

* contignames: A list of contignames
* lengths: A list of contig lengths
* rpkms: A numpy array of rpkms
* tnfs: A numpy array of tnfs

I'm going to use `pickle` to save the Python lists and `vamb.vambtools.write_npz` to save the Numpy arrays (the latter is just a wrapper for `numpy.savez_compressed`). Of course, I could have used pickle for it all.

---

In [25]:
import pickle

with open('/home/jakni/Downloads/example/contignames.pickle', 'wb') as file:
    pickle.dump(contignames, file, protocol=4)

with open('/home/jakni/Downloads/example/lengths.pickle', 'wb') as file:
    pickle.dump(lengths, file, protocol=4)

with open('/home/jakni/Downloads/example/tnfs.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, tnfs)
    
with open('/home/jakni/Downloads/example/rpkms.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, rpkms)

## Now we need to train the autoencoder

Again, you can use `help` to see how to use the module

`help(vamb.encode)`

    Help on module vamb.encode in vamb:

    NAME
        vamb.encode - Encode a depths matrix and a tnf matrix to latent representation.

    DESCRIPTION
        Creates a variational autoencoder in PyTorch and tries to represent the depths
        and tnf in the latent space under gaussian noise.

        usage:
        >>> vae, dataloader = trainvae(depths, tnf) # Make & train VAE on Numpy arrays
        >>> latent = vae.encode(dataloader) # Encode to latent representation
        >>> latent.shape
        (183882, 40)
        
    [ lines elided ]
    
---
Aha, so we need to use the `trainvae` function first, then the `VAE.encode` method. You can call the `help` functions on those, but I'm not showing that here.

In [5]:
# I'm training just 5 epochs for this demonstration.
# When actually using the VAE, 200-300 epochs are suitable
vae, dataloader = vamb.encode.trainvae(rpkms, tnfs, nepochs=5, verbose=True)

Epoch: 1	Loss: 2.4766	BCE: 2.4632	MSE: 0.00394	KLD: 0.0094
Epoch: 2	Loss: 2.2839	BCE: 2.2691	MSE: 0.00599	KLD: 0.0088
Epoch: 3	Loss: 2.1273	BCE: 2.1102	MSE: 0.00837	KLD: 0.0087
Epoch: 4	Loss: 1.9752	BCE: 1.9554	MSE: 0.01141	KLD: 0.0083
Epoch: 5	Loss: 1.8483	BCE: 1.8260	MSE: 0.01442	KLD: 0.0079


---
We can see the Mean Squared Error (which is the TNF-related loss) is rising these first 5 epochs, presumably as it sacrifices an efficient representation of the TNF in order to learn the depths (whose loss is BCE) better. This is quite expected, and we have the MSE loss typically 2-3 orders of magnitude less than BCE exactly so it will make this choice.

Okay, so now we have the trained `vae` and the `dataloader`. Let's feed the dataloader to the VAE in order to get the latent representation:

---

In [6]:
latent = vae.encode(dataloader)

print(latent.shape)

(39551, 40)


---
That's 39551 contigs each represented by the value of 40 latent neurons.

Now we need to cluster this. But first, we must determine a proper clustering threshold.

---

## Determining the clustering threshold

# Also add this part when it's stable

In [7]:
threshold = 0.03

## Clustering the latent representation

There's two clustering algorithms: A `vamb.cluster.cluster`, an accurate one which scales badly (quadratically) with large datasets (up to one or two million contigs is alright depending on your patience), and `vamb.cluster.tandemcluster`, a less accurate one which scales better.

The heavy lifting here is done in Numpy, so it might be worth making sure the BLAS library your Numpy is using is fast. You can check it with `numpy.__config__.show()` and if it says something with `mkl` or `openblas`, you're golden.

I have a measly 40k contigs, so I'm obviously going for the slow but accurate function.

---

In [20]:
help(vamb.cluster.cluster)

Help on function cluster in module vamb.cluster:

cluster(matrix, labels, inner, outer=None, max_steps=15, spearman=False)
    Iterative medoid cluster generator. Yields (medoid), set(labels) pairs.
    
    Inputs:
        matrix: A (obs x features) Numpy matrix of values
        labels: Numpy array with labels for matrix rows. None or 1-D array
        inner: Optimal medoid search within this distance from medoid
        outer: Radius of clusters extracted from medoid. If None, same as inner
        max_steps: Stop searching for optimal medoid after N futile attempts
        spearman: Use Spearman, not Pearson correlation
    
    Output: Generator of (medoid, set(labels_in_cluster)) tuples.



In [8]:
labels = np.array(contignames)
cluster_iterator = vamb.cluster.cluster(latent, labels, threshold)

clusters = dict()
for medoid, contigs in cluster_iterator:
    clusters[medoid] = contigs

## Postprocessing the clusters

This is not automatic, because how to do it really depends on what you're looking for in your data.

One of the greatest weaknesses of Vamb is that the bins tend to be highly fragmented. You'll have lots of tiny bins, some of which are legitimate (viruses, plasmids), but most are parts of larger genomes that didn't get binned properly.

Here, let's say we're only interested in bacteria. So we throw away all bins with less than 250,000 basepairs

---

In [9]:
# First let's make a contignames: length dict
lengthof = dict(zip(contignames, lengths))

# Now filter away the small bins
filtered_bins = dict()

for medoid, contigs in clusters.items():
    binsize = sum(lengthof[contig] for contig in contigs)
    
    if binsize >= 250000:
        filtered_bins[medoid] = contigs

In [10]:
print('Number of bins before filtering:', len(clusters))
print('Number of bins after filtering:', len(filtered_bins))

Number of bins before filtering: 6641
Number of bins after filtering: 113


---
Now, let's print them. For this we will use two writer functions:

1) `vamb.cluster.writeclusters`, which writes which clusters contains which contigs to a simple tab-separated file, and

2) `vamb.vambtools.writebins`, which writes FASTA files corresponding to each of the bins to a directory.

We will need to load all the contigs belonging to any bin into memory to use `vamb.vambtools.writebins`. If your bins don't fit in memory, you gotta find another way to make those FASTA bins

---

In [12]:
with open('/home/jakni/Downloads/example/bins.tsv', 'w') as file:
    vamb.cluster.writeclusters(file, filtered_bins)

In [3]:
# Only keep contigs in any filtered bin in memory
allcontigs = set.union(*filtered_bins.values())

with open('/home/jakni/Downloads/example/contigs.fna', 'rb') as file:
    fastadict = vamb.vambtools.loadfasta(file, keep=allcontigs)
    
vamb.vambtools.writebins('/home/jakni/Downloads/example/bins/', filtered_bins, fastadict)

## (If you have a reference: Benchmark the output)

For this to make any sense, you need to have a *reference*, that is, a list of bins that are deemed true and complete.

The reference could be a {clustername: set(contigs)} dict along with a {contigname: length} dict, just like the `clusters` and `lengthof` we made. It could also be a tab-separated file with clustername, contigname, length-rows, one row per contig.

Now, I have no reference for this dataset, so I created a reference file completely randomly:

In [26]:
!head /home/jakni/Downloads/example/reference.tsv

# binname contigname length
0	s198_NODE_2960_length_5085_cov_5.30505	5085
0	s30_NODE_9489_length_2530_cov_2.23365	2530
0	s179_NODE_2638_length_5642_cov_2.52661	5642
0	s30_NODE_160_length_42890_cov_12.914	42890
0	s198_NODE_4819_length_3620_cov_4.35672	3620
0	s178_NODE_2065_length_4779_cov_4.00513	4779
0	s198_NODE_1167_length_8851_cov_6.04376	8851
0	s198_NODE_7205_length_2698_cov_5.10081	2698
0	s198_NODE_5233_length_3401_cov_5.00303	3401


---
We of course expect the benchmark to show we have at most a handful of very incomplete bins, since the reference is random.

---

In [27]:
reference_path = '/home/jakni/Downloads/example/reference.tsv'

with open(reference_path) as filehandle:b
    reference = vamb.benchmark.Reference.fromfile(filehandle)

---
We also need to instantiate the Observed bins (which we created above!), and a BenchMarkResult

---

In [31]:
# We could also do this from a file, but here we have the dictionary at hand
observed = vamb.benchmark.Observed(filtered_bins, reference)

# Keyword-only arguments to make sure you don't accidentally swap them around.
# It'll raise an error if you use non-keyword arguments.
result = vamb.benchmark.BenchMarkResult(reference=reference, observed=observed)

In [41]:
result = vamb.benchmark.BenchMarkResult(reference=reference, observed=observed)

In [42]:
# Okay, how did we do?
result.printmatrix()

	Recall
Prec.	0.3	0.4	0.5	0.6	0.7	0.8	0.9	0.95
0.7	0	0	0	0	0	0	0	0
0.8	0	0	0	0	0	0	0	0
0.9	0	0	0	0	0	0	0	0
0.95	0	0	0	0	0	0	0	0
0.99	0	0	0	0	0	0	0	0


---
As expected (because the reference was randomly generated), the results are terrible - in fact, they couldn't be worse.

To check what else the BenchMarkResult measures, check `help(vamb.benchmark.BenchMarkResult)`

---

## Now what?

Here's some ideas:

__Quality control your bins with CheckM__

CheckM tries to asses the contimation and completeness of the given bins. It also tries to classify them... with more limited success

__Improve the bins with Stranglerfig or RefineM__

These tools inspect your bins and refines them by reassigning contigs between bins.