# Walkthough of Vamb from the Python interpreter

The Vamb pipeline consist of a handful of tasks each which have a dedicated module:

---
1) Parse fasta file and get TNF of each sequence, as well as sequence length and names

2) Parse the BAM files and get abundance estimate for each sequence in the fasta file

3) Train a VAE with the depths and TNF matrices

4) Encode the depths and TNF matrices using the VAE

5) Cluster the encoded inputs to metgenomic bins

---
In the following chapters of this walkthrough, we will go through each step in more detail from within the Python interpreter. We will explain what each step does, some of the theory behind the actions, and the different parameters that can be set. With this knowledge, you should be able to extend Vamb relatively easily.

For the examples, we will assume the two relevant prerequisite files exists in the directory `/home/jakni/Downloads/example`:

* `contigs.fna` - The filtered FASTA contigs which were mapped against, and
* `bamfiles/*.bam` - The 6 .bam files from mapping the reads to the contigs above.

---

## Step zero: Importing Vamb and getting help

First step is to get Vamb imported
    
    [jakni@nissen:~]$ python
    >>> import vamb
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ModuleNotFoundError: No module named 'vamb'
    >>> # We're not in the directory containing the vamb directory.
    >>> # That means the directory containing the vamb dir is not in out sys.path.
    >>> # Either move the vamb directory to one of your sys.path dirs 
    >>> # or add the vamb directory to sys.path. We'll do the latter.
    >>> import sys
    >>> sys.path.append('/home/jakni/Documents/scripts/')
    >>> import vamb
    >>> # No error message - success!

You'll almost certianly need help when using Vamb (we wish it was so easy you didn't, but making user friendly software is *hard!*).

Luckily, there's the built-in `help` function in Python.

---

`>>> help(vamb)`
    
    Help on package vamb:

    NAME
        vamb - Variational Autoencoder for Metagenomic Binning

    DESCRIPTION
        Vamb does what it says on the tin - bins metagenomes using a variational autoencoder.
        
    [ lines elided ]
    
        General workflow:
        1) Filter contigs by size using vamb.filtercontigs
        2) Map reads to contigs to obtain BAM file
        3) Calculate TNF of contigs using vamb.parsecontigs
        4) Create RPKM table using vamb.parsebam
        5) Train autoencoder using vamb.encode
        6) Cluster latent representation using vamb.cluster
    
    [ lines elided ]
    
---
    
The `PACKAGE CONTENTS` under `help(vamb)` is just a list of all importable files in the `vamb` directory - some of these really shouldn't be imported, so ignore that.

---
You can also get help for the modules:

`>>> help(vamb.cluster)`

    Help on module vamb.cluster in vamb:

    NAME
        vamb.cluster - Iterative medoid clustering of Numpy arrays.

    DESCRIPTION
        Implements two core functions: cluster and tandemcluster, along with the helper
        functions writeclusters and readclusters.
        For all functions in this module, a collection of clusters are represented as
        a {clustername, set(elements)} dict.

        Clustering algorithm:
    
    [ lines elided ]
        
---
And for functions:

`>>> help(vamb.cluster.tandemcluster)`

    Help on function tandemcluster in module vamb.cluster:

    tandemcluster(matrix, labels, inner, outer=None, max_steps=15, normalized=False)
        Splits the datasets, then clusters each partition before merging
        the resulting clusters. This is faster, especially on larger datasets, but
        less accurate than normal clustering.

        Inputs:
            matrix: A (obs x features) Numpy matrix of values
            labels: Numpy array with labels for matrix rows. None or 1-D array
            inner: Optimal medoid search within this distance from medoid
            outer: Radius of clusters extracted from medoid. If None, same as inner
            max_steps: Stop searching for optimal medoid after N futile attempts
            normalized: Matrix is already zscore-normalized [False]

        Output: {(partition, medoid): set(labels_in_cluster) dictionary}

In [1]:
import sys
sys.path.append('/home/jakni/Documents/scripts/')
import vamb

## Step one: Parse the FASTA file

If you forget what to do at each step, remember that `help(vamb)` said:

    General workflow:
    1) Filter contigs by size using vamb.filtercontigs
    2) Map reads to contigs to obtain BAM file
    3) Calculate TNF of contigs using vamb.parsecontigs
    
    [ lines elided ]

Okay, we already have filtered contigs. I could have used the `filtercontigs.py` script in the Vamb directory (only usable from command line) to filter the contigs, but here, they were already filtered. We have already mapped reads to them and gotten BAM files, so we begin with the `vamb.parsecontigs` module. How do you use that?

In [3]:
help(vamb.parsecontigs)

Help on module vamb.parsecontigs in vamb:

NAME
    vamb.parsecontigs - Calculate z-normalized tetranucleotide frequency from a FASTA file.

DESCRIPTION
    Usage:
    >>> with open('/path/to/contigs.fna', 'rb') as filehandle
    ...     tnfs, contignames, lengths = read_contigs(filehandle)

FUNCTIONS
    read_contigs(byte_iterator, minlength=100)
        Parses a FASTA file open in binary reading mode.
        
        Input:
            byte_iterator: Iterator of binary lines of a FASTA file
            minlength[100]: Ignore any references shorter than N bases 
        
        Outputs:
            tnfs: A (n_FASTA_entries x 136) matrix of tetranucleotide freq.
            contignames: A list of contig headers
            lengths: A Numpy array of contig lengths

DATA
    TNF_HEADER = '#contigheader\tAAAA/TTTT\tAAAC/GTTT\tAAAG/CTTT\tAAAT...A...
    __cmd_doc__ = 'Calculate z-normalized tetranucleotide frequency...eoti...

FILE
    /home/jakni/Documents/scripts/vamb/parsecontigs.py



---
I use `vamb.parsecontigs.read_contigs` with the inputs and outputs as written:

---

In [2]:
# Open the file in binary mode - you can use the vamb.vambtools.Reader to read
# from normal or gzipped file seamlessly. Here I just use the open function
with open('/home/jakni/Downloads/example/contigs.fna', 'rb') as filehandle:
    tnfs, contignames, lengths = vamb.parsecontigs.read_contigs(filehandle)

# Let's have a look at the resulting data

print('Type of tnfs:', type(tnfs), 'of dtype', tnfs.dtype)
print('Shape of tnfs:', tnfs.shape, end='\n\n')

print('Type of contignames:', type(contignames))
print('Length of contignames:', len(contignames), end='\n\n')

print('First 10 elements of contignames:')
for i in range(10):
    print(contignames[i])
    
print('\n')
    
print('Type of lengths:', type(lengths), 'of dtype', lengths.dtype)
print('Length of lengths:', len(lengths), end='\n\n')

print('First 10 elements of lengths:')
for i in range(10):
    print(lengths[i])

Type of tnfs: <class 'numpy.ndarray'> of dtype float32
Shape of tnfs: (39551, 136)

Type of contignames: <class 'list'>
Length of contignames: 39551

First 10 elements of contignames:
s30_NODE_1_length_245508_cov_18.4904
s30_NODE_2_length_222690_cov_39.7685
s30_NODE_3_length_222459_cov_20.3665
s30_NODE_4_length_173155_cov_20.1181
s30_NODE_5_length_161239_cov_20.1237
s30_NODE_6_length_157102_cov_20.734
s30_NODE_7_length_156768_cov_44.8078
s30_NODE_8_length_152691_cov_19.6759
s30_NODE_9_length_121154_cov_21.6491
s30_NODE_10_length_119726_cov_136.834


Type of lengths: <class 'numpy.ndarray'> of dtype int64
Length of lengths: 39551

First 10 elements of lengths:
245508
222690
222459
173155
161239
157102
156768
152691
121154
119726


---
It turns out that related organisms tend to share a similar kmer-distribution across most of their genome. The reason for that is not understood, even though it's believed that common functional motifs, GC-content and presence/absence of endonucleases explains some of the observed similary.

The `tnfs` is the tetranucleotide frequency - it's the frequency of the canonical kmer of each 4mer in the contig. The matrix is z-score normalized across contigs for each sample such that the frequency of e.g. 'AGGC' is measured relative to other contigs in that sample - this increases the signal-to-noise ratio.

We use 4-mers because there are 136 canonical 4-mers, which is an appropriate number of features to cluster - not so few that there's no signal and not so many it becomes unwieldy and the estimates of the frequencies become uncertain. We could also have used 3-mers and 5-mers. In tests we have made, 3-mers are _almost_, but not quite as good as 4-mers for separating different species. There are 512 canonical 5-mers, that would be too many features to handle comfortably, and it could easily cause memory issues. You could probably switch tetranucleotide frequency to trinucleotide frequency in Vamb without any significant drop of accuracy.

At this points, you should probably consider whether you can keep everything in memory. If not, all the relevant modules have reading and writing functions so you can dump the results to disk and delete them from memory. This is a small dataset, so there's no problem. With hundreds of samples and millions of contigs however, this becomes a problem, even though Vamb is fairly memory-friendly.

As a rule of thumb, the memory consumption for the most memory intensive step is approximately 8 × (n_samples + 136) × n_contigs bytes plus a little bit of overhead. If this is much lower than your RAM, don't worry about it. If it's within a factor 2 of your available RAM, you'll need to delete objects you don't need anymore.

In my example, I have 6 samples and 39551 contigs for a total memory usage of ~45 MB.

---

## Step two: Parse the BAM files

In [4]:
# Again, we can use the help function to see what we need to do
help(vamb.parsebam.read_bamfiles)

Help on function read_bamfiles in module vamb.parsebam:

read_bamfiles(paths, minscore=50, minlength=100, processes=4)
    Spawns processes to parse BAM files and get contig rpkms.
        
    Input:
        path: Path to BAM file
        minscore [50]: Minimum alignment score (AS field) to consider
        minlength [100]: Ignore any references shorter than N bases 
        processes [4]: Number of processes to spawn
    
    Outputs:
        sample_rpkms: A {path: Numpy-32-float-RPKM} dictionary
        contignames: A list of contignames from first BAM header



---
We can see (in the default value for the `processes` argument) that the default number of parallel BAM-reading processes it will spawn is 4. This is because Python detected 4 threads on my laptop. In general, VAMBs default here is to use the number of availbel threads, or 8 threads if more than 8 is detected. At this point, the BAM-reading will almost certainly become IO bound.

As with the `vamb.parsecontigs.read_contigs` function, I don't care about the `minlength` argument, since our fasta file is already filtered. Again, I will re-iterate that filtering the FASTA file _before_ mapping leads to the best results.

Lastly, the function ignores all alignments with alignment score less than 50 (as determined by the optional `AS:i` field in the BAM file). That seems reasonable here.

---

In [4]:
bamfiles = !ls /home/jakni/Downloads/example/bamfiles
bamfiles = ['/home/jakni/Downloads/example/bamfiles/' + p for p in bamfiles]
bamfiles

['/home/jakni/Downloads/example/bamfiles/e101.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e178.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e179.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e196.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e198.filtered.bam',
 '/home/jakni/Downloads/example/bamfiles/e30.filtered.bam']

In [5]:
# That looks right.

# We already have the contignames, so who cares about saving that
sample_rpkms, _ = vamb.parsebam.read_bamfiles(bamfiles)
del _

print('Type of sample_rpkms:', type(sample_rpkms))

print('Content of the dict:')
print('First key:', next(iter(sample_rpkms.keys())))
print('First value:', next(iter(sample_rpkms.values())))

Type of sample_rpkms: <class 'dict'>
Content of the dict:
First key: /home/jakni/Downloads/example/bamfiles/e101.filtered.bam
First value: [0.11535063 0.17579393 0.58409214 ... 0.         0.         0.        ]


---
The idea here is that two contigs from the same genome will always be physically present together, and so they should have a similar abundance across all samples. Some contigs represent repeats like duplicated segments - these contigs should have a fixed ratio of abundance to other contigs. Thus, even when considering repeated contigs, there should be a tight Pearson correlation between abundances of contigs from the same genome.

The `vamb.parsebam` module takes a rather crude approach to estimating abundance, namely by simply counting the number of mapped reads to each contig, divided by total number of reads and the contig's length. This measure is in trancriptomics often called RPKM, *reads per kilobase per million mapped reads*. Other metagenomic binners like Metabat and Canopy uses an average of per-nucleotide depth of coverage instead. We do not believe there is any theoretical or practical advantage of using depth over RPKM. Because BWA handles redundant databases rather poorly, there is not even any advantage of using FPKM over RPKM. We will use the terms *depth* and *rpkm* interchangably.

The object we just created, `sample_rpkms` is the depth table in the form of a dictionary with one column (representing one BAM file) per entry. Just like the TNF, this needs to be converted to a (n_contigs x n_features) Numpy array in order for it to be used in the variational autoencoder.

We can use the function `vamb.parsebam.toarray` to do this. This will create a new array in memory while retaining `sample_rpkms`. Each object requires approximately 4 x n_contigs x n_samples bytes. If having two of these objects in memory will consume all your RAM, you can dump the `sample_rpkms` to disk, delete it, and reload it in a matrix using the functions `write_rpkms` and `array_fromnpz` in the `vamb.parsebam` module.

---

In [6]:
rpkms = vamb.parsebam.toarray(sample_rpkms, bamfiles)

---
Now, I tend to be a bit ~~paranoid~~<sup>careful</sup>, so if I loaded in 500 GB of BAM files, I'd want to save the work I have now in case something goes wrong - and we're about to fire up the VAE so lots of things can go wrong.

What importants objects do I have in memory right now?

* `contignames`: A list of contignames
* `lengths`: A Numpy array of contig lengths
* `rpkms`: A Numpy array of rpkms
* `tnfs`: A Numpy array of tnfs

I'm going to use `pickle` to save the Python list and `vamb.vambtools.write_npz` to save the Numpy arrays (the latter is just a wrapper for `numpy.savez_compressed`). Of course, I could have used `pickle` for it all.

---

In [7]:
import pickle

with open('/home/jakni/Downloads/example/contignames.pickle', 'wb') as file:
    pickle.dump(contignames, file, protocol=4)

with open('/home/jakni/Downloads/example/lengths.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, lengths)

with open('/home/jakni/Downloads/example/tnfs.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, tnfs)
    
with open('/home/jakni/Downloads/example/rpkms.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, rpkms)