## Step three: Train the autoencoder and encode input data

Again, you can use `help` to see how to use the module

`>>> help(vamb.encode)`

    Help on module vamb.encode in vamb:

    NAME
        vamb.encode - Encode a depths matrix and a tnf matrix to latent representation.

    DESCRIPTION
        Creates a variational autoencoder in PyTorch and tries to represent the depths
        and tnf in the latent space under gaussian noise.

        usage:
        >>> vae, dataloader = trainvae(depths, tnf) # Make & train VAE on Numpy arrays
        >>> latent = vae.encode(dataloader) # Encode to latent representation
        >>> latent.shape
        (183882, 40)
        
    [ lines elided ]
    
---
Aha, so we need to use the `trainvae` function first, then the `VAE.encode` method. You can call the `help` functions on those, but I'm not showing that here.

Training networks always take some time. If you have a GPU and CUDA installed, you can pass `cuda=True` to train on your GPU for increased speed. I run this on my laptop, so I'll just use my CPU:

In [None]:
# I'm training just 5 epochs for this demonstration.
# When actually using the VAE, 200-300 epochs are suitable
vae, dataloader = vamb.encode.trainvae(rpkms, tnfs, nepochs=5, verbose=True)

---
The VAE encodes the high-dimensional (n_samples + 136 features) input data in a lower dimensional space (nlatent features). When training, it learns both the encoding scheme and attempts to reconstruct the input data given the latent representation influenced by gaussian noise.

The theory here is that the latent representation should be a more efficient encoding of the input data. If the input data for the contigs indeed do fall into bins, an efficient encoding should be to simply encode the bin they belong to, then use the "bin identity" to reconstruct the data. We add noise to prevent it from learning a huge number of slightly different bins, in the most extreme, each bin contains only one contig.

The loss of the VAE is the sum of three measures:

* Binary cross entropy (BCE) measures the dissimilarity of the reconstructed abundances to observed abundances
* Mean squared error (MSE) measures the dissimilary of reconstructed versus observed TNF
* Kullback-Leibler divergence (KLD) measures the dissimilarity between the standard normal distribution and the distribution of encoded values with noise added

At least in principle, the latter term indudes the VAE to not crazily overfit by imposing some sensible prior on the kind of encodings it can choose.

We can see the Mean Squared Error (which is the TNF-related loss) is rising these first 5 epochs, presumably as it sacrifices an efficient representation of the TNF in order to learn the depths (whose loss is BCE) better. This happens sometimes, and it's alright - after all, co-abundance usually contain more information that TNF, and so we have chosen the BCE to be several orders of magnitude higher than the MSE in order for the VAE to be able to make this choice.

Okay, so now we have the trained `vae` and the `dataloader`. Let's feed the dataloader to the VAE in order to get the latent representation:

---

In [None]:
latent = vae.encode(dataloader)

print(latent.shape)

---
That's 39551 contigs each represented by the (non-noisy) value of 40 latent neurons.

Now we need to cluster this. But first, we must determine a proper clustering threshold.

---