# Installation

First step is to get Vamb on your computer.

__If you have `git` installed__:

    [jakni@nissen:Downloads]$ # Clone Vamb from GitHub into Downloads/vamb
[jakni@nissen:Downloads]$ git clone https://github.com/jakobnissen/vamb vamb
    
__If you don't__

    [jakni@nissen:Downloads]$ # You then presumably have access to a Vamb directory
[jakni@nissen:Downloads]$ cp -r /path/to/vamb/directory vamb
    
---
Now you have Vamb on your computer. Time to get it imported
    
    [jakni@nissen:~]$ python
    >>> import vamb
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ModuleNotFoundError: No module named 'vamb'
    >>> # We're not in the directory containing the vamb directory.
    >>> # That means the directory containing the vamb dir is not in out sys.path.
    >>> # Either move the vamb directory to one of you sys.path dirs 
    >>> # or add the vamb directory to sys.path. We'll do the latter.
    >>> import sys
    >>> sys.path.append('/home/jakni/Downloads')
    >>> import vamb
    >>>

# Getting help

You'll almost certianly need help when using Vamb (we wish it was so easy you didn't, but making user friendly software is hard!).

Luckily, there's the built-in `help` function in Python.

---

`>>> help(vamb)`
    
    Help on package vamb:

    NAME
        vamb - Vamb - Variational Autoencoder for Metagenomic Binning

    DESCRIPTION
        Vamb does what it says on the tin - bins metagenomes using a variational autoencoder.
        
    [ ... ]
    
---
    
The `PACKAGE CONTENTS` is just a list of all importable files in the `vamb` directory - some of these really shouldn't be imported, so ignore that.

---
You can also get help for the modules:

`>>> help(vamb.cluster)`

    Help on module vamb.cluster in vamb:

    NAME
        vamb.cluster - Iterative medoid clustering of Numpy arrays.

    DESCRIPTION
        Implements two core functions: cluster and tandemcluster, along with the helper
        functions writeclusters and readclusters.
        For all functions in this module, a collection of clusters are represented as
        a {clustername, set(elements)} dict.

        Clustering algorithm:
        [...]
        
---
And for functions:

`>>> help(vamb.cluster.tandemcluster)`

    Help on function tandemcluster in module vamb.cluster:

    tandemcluster(matrix, labels, inner, outer=None, max_steps=15, spearman=False)
        Splits the datasets, then clusters each partition before merging
        the resulting clusters. This is faster, especially on larger datasets, but
        less accurate than normal clustering.

        Inputs:
            matrix: A (obs x features) Numpy matrix of values
            labels: Numpy array with labels for matrix rows. None or 1-D array
            inner: Optimal medoid search within this distance from medoid
            outer: Radius of clusters extracted from medoid. If None, same as inner
            max_steps: Stop searching for optimal medoid after N futile attempts
            spearman: Use Spearman, not Pearson correlation

        Output: {medoid: set(labels_in_cluster) dictionary}

# A simple workflow example

You begin with some some FASTQ files from, say, 6 samples. First you do the following steps:

1) Preprocess the reads and check their quality

2) Assemble each sample individually OR co-assemble and get the contigs out

3) Concatenate the FASTA files together while making sure all contig headers stay unique

Now, like other metagenomic binners of contigs, Vamb relies on two properties of the contigs: The abundance of the contigs in each sample and the kmer-composition of the contigs. The observed values for both of these measures become uncertain when the contig is too small, so you need to filter the small contigs away:

4) Remove all small contigs from the FASTA file (say, less than 2000 bp in length)

To estimate the abundance of the contigs, you need to map the reads from each sample to the FASTA file. When using BWA, don't filter for unproperly paired reads or minimum alignment score.

5) Map the reads to the FASTA file to obtain 6 .bam files
___

This gives us the following results, here put in `/home/jakni/Downloads/example`:

* `contigs.fna` - The filtered FASTA contigs which were mapped against, and
* `bamfiles/*.bam` - The 6 .bam files from mapping the reads to the contigs above.



In [2]:
import sys
sys.path.append('/home/jakni/Documents/scripts/')
import vamb

## First, parse the FASTA file

In [30]:
help(vamb.parsecontigs.read_contigs)

Help on function read_contigs in module vamb.parsecontigs:

read_contigs(contigpath, minlength=2000)
    Parses a FASTA file and produces a list of headers and a matrix of TNF.
    
    Input:
        contigpath: Path to a FASTA file with contigs
        min_length[2000]: Minimum length of contigs
    
    Outputs:
        contignames: A list of contig headers
        tnfs: A (n_FASTA_entries x 136) matrix of tetranucleotide freq.



In [97]:
contigpath = '/home/jakni/Downloads/example/contigs.fna'

# Open the file in binary mode - you can use the vamb.vambtools.Reader to read
# from normal or gzipped file seamlessly
with open(contigpath, 'rb') as filehandle:
    tnfs, contignames, lengths = vamb.parsecontigs.read_contigs(filehandle)

In [29]:
print('Type of tnfs:', type(tnfs), 'of dtype', tnfs.dtype)
print('Shape of tnfs:', tnfs.shape, end='\n\n')

print('Type of contignames:', type(contignames))
print('Length of contignames:', len(contignames), end='\n\n')

print('First 10 elements of contignames:')
for i in range(10):
    print(contignames[i])

Type of tnfs: <class 'numpy.ndarray'> of dtype float32
Shape of tnfs: (39551, 136)

Type of contignames: <class 'list'>
Length of contignames: 39551

First 10 elements of contignames:
s30_NODE_1_length_245508_cov_18.4904
s30_NODE_2_length_222690_cov_39.7685
s30_NODE_3_length_222459_cov_20.3665
s30_NODE_4_length_173155_cov_20.1181
s30_NODE_5_length_161239_cov_20.1237
s30_NODE_6_length_157102_cov_20.734
s30_NODE_7_length_156768_cov_44.8078
s30_NODE_8_length_152691_cov_19.6759
s30_NODE_9_length_121154_cov_21.6491
s30_NODE_10_length_119726_cov_136.834


---
The tnfs is the tetranucleotide frequency - it's the frequency of the canonical kmer of each 4mer in the contig. The matrix is normalized across samples such that the frequency of e.g. 'AGGC' is measured relative to other contigs.

Here, you should probably consider whether you can keep everything in memory. If not, most module have reading and writing methods so you can dump the results and delete them from memory. This is a small dataset, so there's no problem. With hundreds of samples and millions of contigs, this becomes a problem, even though Vamb is fairly memory-friendly

---

## Parsing the BAM files

In [32]:
help(vamb.parsebam.read_bamfiles)

Help on function read_bamfiles in module vamb.parsebam:

read_bamfiles(paths, minscore=50, minlength=2000, processors=4)
    Spawns processes to parse BAM files and get contig rpkms.
    
    Input:
        path: Path to BAM file
        minscore [50]: Minimum alignment score (AS field) to consider
        minlength [2000]: Discard any references shorter than N bases 
        processors [all]: Number of processes to spawn
    
    Outputs:
        sample_rpkms: A {path: Numpy-32-float-RPKM} dictionary
        contignames: A list of contignames from first BAM header



---
We can see that the function detects 4 CPUs on my laptop. That means we can read 4 BAM files at a time. Each BAM file being read takes up some memory, but unless you have a machine with tonnes of CPUs and little RAM, that's not going to be an issue. In either case, it's probably going to be IO bound when we give it 8+ cores, so whatever.

This step is not super optimized, but the largest file is just 1.8 GB, so it takes a few minutes

---

In [40]:
bamfiles = !ls /home/jakni/Downloads/example/bamfiles

In [42]:
bamfiles = ['/home/jakni/Downloads/example/bamfiles/' + p for p in bamfiles]
bamfiles

['/home/jakni/Downloads/example/bamfiles/e101.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e178.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e179.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e196.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e198.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e30.filtered.bam']

In [43]:
# That looks right.

# We already have the contignames, so who cares about saving that
sample_rpkms, _ = vamb.parsebam.read_bamfiles(bamfiles)
del _

In [61]:
print('Type of sample_rpkms:', type(sample_rpkms))

print('Content of the dict:')
print('First key:', next(iter(sample_rpkms.keys())))
print('First value:', next(iter(sample_rpkms.values())))

Type of sample_rpkms: <class 'dict'>
Content of the dict:
First key: /home/jakni/Downloads/example/bamfiles/e101.filtered.bam
First value: [0.11535063 0.17579393 0.58409214 ... 0.         0.         0.        ]


---
This is essentially the depths (RPKMS) table, here in the form of a dict.

We need to convert this to a Numpy array to feed it to the VAE

---

In [62]:
help(vamb.parsebam.array_from_sample_rpkms)

Help on function array_from_sample_rpkms in module vamb.parsebam:

array_from_sample_rpkms(sample_rpkms, columns)
    Creates a (n-contigs x n-bamfiles) array from a sample_rpkms
    (expected to be a {path: Numpy-32-float-RPKM} dictionary)
    
    Inputs: 
        sample_rpkms: A {path: Numpy-32-float-RPKM} dictionary
        columns: Names of BAM file paths in sample_rpkms object in correct order
        
    Output: A (n-contigs x n-bamfiles) array with each column being the corre-
    sponding array from sample_rpkms, normalized so each row sums to 1



In [64]:
rpkms = vamb.parsebam.array_from_sample_rpkms(sample_rpkms, bamfiles)

# Now we're keeping the depths (RPKMS) twice in memory - we could delete
# the dict if we wanted, but again, this dataset it so small it hardly matters.

# The RAM consumption of either is approximately 4 * n_samples * n_contigs bytes

## Now we need to train the autoencoder

# ADD THIS PART WHEN I'VE ADDED THE VAE TO THE API

here I just load the latent from disk - this should be in-memory

In [70]:
import pandas as pd
import numpy as np

In [71]:
latent = pd.read_csv('/home/jakni/Downloads/binningexample/latent.tsv', delimiter='\t', dtype=np.float32, header=None)

In [80]:
latent = latent.values

## Determining the clustering threshold

# Also add this part when it's stable

In [87]:
threshold = 0.03

## Clustering the latent representation

There's two clustering algorithms: A `vamb.cluster.cluster`, an accurate one which scales badly with large datasets (up to one or two million contigs is alright), and `vamb.cluster.tandemcluster` less accurate which scales better.

The heavy lifting here is done in Numpy, so it might be worth making sure the BLAS library your Numpy is using is fast.

I have 40k contigs, so I'm obviously going for the slow but accurate.

In [75]:
help(vamb.cluster.cluster)

Help on function cluster in module vamb.cluster:

cluster(matrix, labels, inner, outer=None, max_steps=15, spearman=False)
    Iterative medoid cluster generator. Yields (medoid), set(labels) pairs.
    
    Inputs:
        matrix: A (obs x features) Numpy matrix of values
        labels: Numpy array with labels for matrix rows. None or 1-D array
        inner: Optimal medoid search within this distance from medoid
        outer: Radius of clusters extracted from medoid. If None, same as inner
        max_steps: Stop searching for optimal medoid after N futile attempts
        spearman: Use Spearman, not Pearson correlation
    
    Output: Generator of (medoid, set(labels_in_cluster)) tuples.



In [88]:
labels = np.array(contignames)
cluster_iterator = vamb.cluster.cluster(latent, labels, threshold)

clusters = dict()
for medoid, contigs in cluster_iterator:
    clusters[medoid] = contigs

## Postprocessing the clusters

This is not automatic, because it really depends on what you're looking for.

One of the greatest weaknesses of Vamb is that the bins tend to be highly fragmented. You'll have lots of tiny bins, some of which are legitimate (viruses, plasmids), but most are parts of larger genomes that didn't get binned properly.

Here, let's say we're only interested in bacteria. So we throw away all bins with less than 250,000 basepairs

In [101]:
# First let's make a contignames: length dict
lengthof = {name:length for name, length in zip(contignames, lengths)}

# Now filter away the small bins
filtered_bins = dict()

for medoid, contigs in clusters.items():
    binsize = sum(lengthof[contig] for contig in contigs)
    
    if binsize >= 250000:
        filtered_bins[medoid] = contigs

In [103]:
print('Number of bins before filtering:', len(clusters))
print('Number of bins after filtering:', len(filtered_bins))

Number of bins before filtering: 1520
Number of bins after filtering: 86


## (If you have a reference: Benchmark Vamb)

For this to make any sense, you need to have a *reference*, that is, a list of bins that are deemed true and complete.

The reference could be a {clustername: set(contigs)} dict along with a {contigname: length} dict, just like the `clusters` and `lengthof` we made. It could also be a tab-separated file with clustername, contigname, length rows, one row per contig.

Now, I have no reference for this dataset, so I created a reference file completely randomly:

In [122]:
!head /home/jakni/Downloads/example/reference.tsv

# binname contigname length
0	s198_NODE_2960_length_5085_cov_5.30505	5085
0	s30_NODE_9489_length_2530_cov_2.23365	2530
0	s179_NODE_2638_length_5642_cov_2.52661	5642
0	s30_NODE_160_length_42890_cov_12.914	42890
0	s198_NODE_4819_length_3620_cov_4.35672	3620
0	s178_NODE_2065_length_4779_cov_4.00513	4779
0	s198_NODE_1167_length_8851_cov_6.04376	8851
0	s198_NODE_7205_length_2698_cov_5.10081	2698
0	s198_NODE_5233_length_3401_cov_5.00303	3401


In [123]:
# I have no reference for this dataset, so I just made a completely random one
# since any correct bins according to this reference would be entirely incidental
# the benchmark will probably show zero good bins.
reference_path = '/home/jakni/Downloads/example/reference.tsv'

with open(reference_path) as filehandle:
    reference = vamb.benchmark.Reference.fromfile(filehandle)

---
We also need to instantiate the Observed bins (which we created above!), and a BenchMarkResult

---

In [124]:
observed = vamb.benchmark.Observed(clusters, reference)
result = vamb.benchmark.BenchMarkResult(reference=reference, observed=observed)

In [125]:
# Okay, how did we do?
result.printmatrix()

	Recall
Prec.	0.3	0.4	0.5	0.6	0.7	0.8	0.9	0.95
0.7	0	0	0	0	0	0	0	0
0.8	0	0	0	0	0	0	0	0
0.9	0	0	0	0	0	0	0	0
0.95	0	0	0	0	0	0	0	0
0.99	0	0	0	0	0	0	0	0


---
As expected (because the reference was randomly generated), the results are terrible - in fact, they couldn't be worse.

---

## Now what?

Here's some ideas:

__Quality control your bins with CheckM__

CheckM tries to asses the contimation and completeness of the given bins. It also tries to classify them... with more limited success

__Improve the bins with Stranglerfig or RefineM__

These tools inspect your bins and refines them by reassigning contigs between bins.