## Step zero: Importing Vamb and getting help

First step is to get Vamb imported
    
    [jakni@nissen:~]$ python
    >>> import vamb
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ModuleNotFoundError: No module named 'vamb'
    >>> # We're not in the directory containing the vamb directory.
    >>> # That means the directory containing the vamb dir is not in out sys.path.
    >>> # Either move the vamb directory to one of your sys.path dirs 
    >>> # or add the vamb directory to sys.path. We'll do the latter.
    >>> import sys
    >>> sys.path.append('/home/jakni/Documents/scripts/')
    >>> import vamb
    >>> # No error message - success!

You'll almost certianly need help when using Vamb (we wish it was so easy you didn't, but making user friendly software is *hard!*).

Luckily, there's the built-in `help` function in Python.

---

`>>> help(vamb)`
    
    Help on package vamb:

    NAME
        vamb - Variational Autoencoder for Metagenomic Binning

    DESCRIPTION
        Vamb does what it says on the tin - bins metagenomes using a variational autoencoder.
        
    [ lines elided ]
    
        General workflow:
        1) Filter contigs by size using vamb.filtercontigs
        2) Map reads to contigs to obtain BAM file
        3) Calculate TNF of contigs using vamb.parsecontigs
        4) Create RPKM table using vamb.parsebam
        5) Train autoencoder using vamb.encode
        6) Cluster latent representation using vamb.cluster
    
    [ lines elided ]
    
---
    
The `PACKAGE CONTENTS` under `help(vamb)` is just a list of all importable files in the `vamb` directory - some of these really shouldn't be imported, so ignore that.

---
You can also get help for the modules:

`>>> help(vamb.cluster)`

    Help on module vamb.cluster in vamb:

    NAME
        vamb.cluster - Iterative medoid clustering of Numpy arrays.

    DESCRIPTION
        Implements two core functions: cluster and tandemcluster, along with the helper
        functions writeclusters and readclusters.
        For all functions in this module, a collection of clusters are represented as
        a {clustername, set(elements)} dict.

        Clustering algorithm:
    
    [ lines elided ]
        
---
And for functions:

`>>> help(vamb.cluster.tandemcluster)`

    Help on function tandemcluster in module vamb.cluster:

    tandemcluster(matrix, labels, inner, outer=None, max_steps=15, normalized=False)
        Splits the datasets, then clusters each partition before merging
        the resulting clusters. This is faster, especially on larger datasets, but
        less accurate than normal clustering.

        Inputs:
            matrix: A (obs x features) Numpy matrix of values
            labels: Numpy array with labels for matrix rows. None or 1-D array
            inner: Optimal medoid search within this distance from medoid
            outer: Radius of clusters extracted from medoid. If None, same as inner
            max_steps: Stop searching for optimal medoid after N futile attempts
            normalized: Matrix is already zscore-normalized [False]

        Output: {(partition, medoid): set(labels_in_cluster) dictionary}

In [None]:
import sys
sys.path.append('/home/jakni/Documents/scripts/')
import vamb

## Step one: Parse the FASTA file

If you forget what to do at each step, remember that `help(vamb)` said:

    General workflow:
    1) Filter contigs by size using vamb.filtercontigs
    2) Map reads to contigs to obtain BAM file
    3) Calculate TNF of contigs using vamb.parsecontigs
    
    [ lines elided ]

Okay, we already have filtered contigs, and we have mapped reads to them and gotten BAM files, so we begin with `vamb.parsecontigs`. How do you use that?

In [None]:
help(vamb.parsecontigs)

---
I use `vamb.parsecontigs.read_contigs` with the inputs and outputs as written:

---

In [None]:
# Open the file in binary mode - you can use the vamb.vambtools.Reader to read
# from normal or gzipped file seamlessly. Here I just use the open function
with open('/home/jakni/Downloads/example/contigs.fna', 'rb') as filehandle:
    tnfs, contignames, lengths = vamb.parsecontigs.read_contigs(filehandle)

# Let's have a look at the resulting data

print('Type of tnfs:', type(tnfs), 'of dtype', tnfs.dtype)
print('Shape of tnfs:', tnfs.shape, end='\n\n')

print('Type of contignames:', type(contignames))
print('Length of contignames:', len(contignames), end='\n\n')

print('First 10 elements of contignames:')
for i in range(10):
    print(contignames[i])
    
print('\n')
    
print('Type of lengths:', type(lengths), 'of dtype', lengths.dtype)
print('Length of lengths:', len(lengths), end='\n\n')

print('First 10 elements of lengths:')
for i in range(10):
    print(lengths[i])

---
It turns out that related organisms tend to share a similar kmer-distribution across most of their genome. The reason for that is not understood, even though it's believed that common functional motifs, GC-content and presence/absence of endonucleases explains some of the observed similary.

The `tnfs` is the tetranucleotide frequency - it's the frequency of the canonical kmer of each 4mer in the contig. The matrix is z-score normalized across contigs for each sample such that the frequency of e.g. 'AGGC' is measured relative to other contigs in that sample - this increases the signal-to-noise ratio.

We use 4-mers because there are 136 canonical 4-mers, which is an appropriate number of features to cluster - not so few that there's no signal and not so many it becomes unwieldy and the estimates of the frequencies become uncertain.

At this points, you should probably consider whether you can keep everything in memory. If not, all the relevant modules have reading and writing functions so you can dump the results to disk and delete them from memory. This is a small dataset, so there's no problem. With hundreds of samples and millions of contigs however, this becomes a problem, even though Vamb is fairly memory-friendly.

As a rule of thumb, the memory consumption for the most memory intensive step is approximately 8 × (n_samples + 136) × n_contigs bytes plus a little bit of overhead. If this is much lower than your RAM, don't worry about it. If it's within a factor 2 of your available RAM, you'll need to delete objects you don't need anymore.

In my example, I have 6 samples and 39551 contigs for a total memory usage of ~45 MB.

---

## Step two: Parse the BAM files

In [None]:
# Again, we can use the help function to see what we need to do
help(vamb.parsebam.read_bamfiles)

---
We can see (in the default value for the `processes` argument) that the function detects 4 cores on my laptop. It will then spawn 4 parallel processes to read the BAM files. It's capped at 8 processes, because at that level, it almost certainly becomes I/O bound.

As with the `vamb.parsecontigs.read_contigs` function, I don't care about the `minlength` argument, since our fasta file is already filtered.

Lastly, the function ignores all alignments with alignment score less than 50 (as determined by the optional `AS:i` field in the BAM file). That seems reasonable here.

---

In [None]:
bamfiles = !ls /home/jakni/Downloads/example/bamfiles
bamfiles = ['/home/jakni/Downloads/example/bamfiles/' + p for p in bamfiles]
bamfiles

In [None]:
# That looks right.

# We already have the contignames, so who cares about saving that
sample_rpkms, _ = vamb.parsebam.read_bamfiles(bamfiles)
del _

print('Type of sample_rpkms:', type(sample_rpkms))

print('Content of the dict:')
print('First key:', next(iter(sample_rpkms.keys())))
print('First value:', next(iter(sample_rpkms.values())))

---
The idea here is that two contigs from the same genome will always be physically present together, and so they should have a similar abundance across all samples. Some contigs represent repeats like duplicated segments - these contigs should have a fixed ratio of abundance to other contigs. Thus, even when considering repeated contigs, there should be a tight Pearson correlation between abundances of contigs from the same genome.

The `vamb.parsebam` module takes a rather crude approach to estimating abundance, namely by simply counting the number of mapped reads to each contig, divided by total number of reads and the contig's length. This measure is in trancriptomics often called RPKM, *reads per kilobase per million mapped reads*. Other metagenomic binners like Metabat and Canopy uses an average of per-nucleotide depth of coverage instead. We do not believe there is any theoretical or practical advantage of using depth over RPKM. We will use the terms *depth* and *rpkm* interchangably.

The object we just created, `sample_rpkms` is the depth table in the form of a dictionary with one column (representing one BAM file) per entry. Just like the TNF, this needs to be converted to a (n_contigs x n_features) Numpy array in order for it to be used in the variational autoencoder.

We can use the function `vamb.parsebam.toarray` to do this. This will create a new array in memory while retaining `sample_rpkms`. Each object requires approximately 4 x n_contigs x n_samples bytes. If having two of these objects in memory will consume all your RAM, you can dump the `sample_rpkms` to disk, delete it, and reload it in a matrix using the functions `write_rpkms` and `array_fromnpz` in the `vamb.parsebam` module.

---

In [None]:
rpkms = vamb.parsebam.toarray(sample_rpkms, bamfiles)

---
Now, I tend to be a bit ~~paranoid~~<sup>careful</sup>, so if I loaded in 500 GB of BAM files, I'd want to save the work I have now in case something goes wrong - and we're about to fire up the VAE so lots of things can go wrong.

What importants objects do I have in memory right now?

* `contignames`: A list of contignames
* `lengths`: A Numpy array of contig lengths
* `rpkms`: A Numpy array of rpkms
* `tnfs`: A Numpy array of tnfs

I'm going to use `pickle` to save the Python list and `vamb.vambtools.write_npz` to save the Numpy arrays (the latter is just a wrapper for `numpy.savez_compressed`). Of course, I could have used `pickle` for it all.

---

In [None]:
import pickle

with open('/home/jakni/Downloads/example/contignames.pickle', 'wb') as file:
    pickle.dump(contignames, file, protocol=4)

with open('/home/jakni/Downloads/example/lengths.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, lengths)

with open('/home/jakni/Downloads/example/tnfs.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, tnfs)
    
with open('/home/jakni/Downloads/example/rpkms.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, rpkms)