## Step three: Train the autoencoder and encode input data

Again, you can use `help` to see how to use the module

`>>> help(vamb.encode)`

    Help on module vamb.encode in vamb:

    NAME
        vamb.encode - Encode a depths matrix and a tnf matrix to latent representation.

    DESCRIPTION
        Creates a variational autoencoder in PyTorch and tries to represent the depths
        and tnf in the latent space under gaussian noise.

        usage:
        >>> vae, dataloader = trainvae(depths, tnf) # Make & train VAE on Numpy arrays
        >>> latent = vae.encode(dataloader) # Encode to latent representation
        >>> latent.shape
        (183882, 40)
        
    [ lines elided ]
    
---
Aha, so we need to use the `trainvae` function first, then the `VAE.encode` method. You can call the `help` functions on those, but I'm not showing that here.

Training networks always take some time. If you have a GPU and CUDA installed, you can pass `cuda=True` to `encode.trainvae` to train on your GPU for increased speed. With a beefy GPU, this can make quite a difference. I run this on my laptop, so I'll just use my CPU.

Sometimes, you'll want to reuse a VAE you have already trained. For this, I've added the `VAE.save` method of the VAE class, as well as a `VAE.load` method. In this example, I'll write the trained model weights to a file in `/tmp` and show how to reload the VAE again. But remember - a trained VAE only works on the dataset it's been trained on!

In [1]:
# Again, we import stuff
import sys
sys.path.append('/home/jakni/Documents/scripts/')
import vamb

# And load the data we just saved in tutorial part 1 - of course, if this was
# the same notebook, we could have just kept it in memory
with open('/home/jakni/Downloads/example/rpkms.npz', 'rb') as file:
    rpkms = vamb.vambtools.read_npz(file)
    
with open('/home/jakni/Downloads/example/tnfs.npz', 'rb') as file:
    tnfs = vamb.vambtools.read_npz(file)

In [2]:
# I'm training just 5 epochs for this demonstration.
# When actually using the VAE, 200-300 epochs are suitable
with open('/tmp/model.pt', 'wb') as modelfile:
    vae, dataloader = vamb.encode.trainvae(rpkms, tnfs, nepochs=5, modelfile=modelfile, verbose=True)

	Errorsum: 1000
	MSE ratio: 0.2
	CUDA: False
	N latent: 40
	N hidden: 325, 325
	N contigs: 39551
	N samples: 6
	N epochs: 5
	Batch size: 128

	Epoch: 1	Loss: 14520.8	CE: 5.38282	MSE: 1.00173	KLD: 0.42735
	Epoch: 2	Loss: 10773.4	CE: 3.98606	MSE: 0.94358	KLD: 0.71098
	Epoch: 3	Loss: 9909.4	CE: 3.66427	MSE: 0.92106	KLD: 0.92786
	Epoch: 4	Loss: 9499.4	CE: 3.51189	MSE: 0.90177	KLD: 1.10762
	Epoch: 5	Loss: 9235.8	CE: 3.41405	MSE: 0.88542	KLD: 1.28936


---
The VAE encodes the high-dimensional (n_samples + 136 features) input data in a lower dimensional space (nlatent features). When training, it learns an encoding scheme, with which it encodes the input data to a series of normal distributions, and a decoding scheme, in which it uses one value sampled from each normal distribution to reconstruct the input data given the latent representation while influenced by gaussian noise.

The theory here is that the latent representation is a more efficient encoding of the input data. If the input data for the contigs indeed do fall into bins, an efficient encoding would be to simply encode the bin they belong to, then use the "bin identity" to reconstruct the data. We force it to encode to *distributions* rather than single values because this makes it more robust - it will not as easily overfit to interpret slightly different values as being very distinct if there is an intrinsic noise in each encoding.

The loss of the VAE is the sum of three measures:

* Cross entropy (CE) measures the dissimilarity of the reconstructed abundances to observed abundances
* Mean squared error (MSE) measures the dissimilary of reconstructed versus observed TNF
* Kullback-Leibler divergence (KLD) measures the dissimilarity between the encoded distributions and the standard gaussian distribution N(0, 1)

The latter term is standard in VAEs, and indudes the VAE to not crazily overfit by imposing some sensible prior on the kind of encodings it can choose.

These terms are weigthed with the keyword arguments `errorsum` and `msefraction`. These numbers adjusts CE and MSE for a naïve network such that CE+MSE = `errorsum` and MSE/(CE+MSE) = `msefraction`. A naïve networks predicts MSE ~ N(0,1) and depth = 1/n_samples for every contig. Such a naïve network has an expected uncorrected CE of log(n_samples)/n_samples and an expected MSE of 2. Broadly speaking, `errorsum` gauges how much the network is allowed to learn - a low value constrains the latent layer to its prior - and `msefraction` gauges how much the network cares about TNF as opposed to depth.

We can see the KL-divergence rises as it learns the dataset and the latent layer drifts away from its prior. At some point, it will begin to overfit too much, and the penalty associated with KL-divergence outweighs the CE and MSE losses. At this point, the KL will stall, and then fall. This point depends on the errorsum and the complexity of the dataset.

Okay, so now we have the trained `vae` and the `dataloader`. Let's feed the dataloader to the VAE in order to get the latent representation:

---

In [3]:
# No need to pass gpu=True to the encode function to encode on GPU
# If you trained the VAE on GPU, it already resides there
latent = vae.encode(dataloader)

print(latent.shape)

(39551, 40)


---
That's 39551 contigs each represented by the (non-noisy) value of 40 latent neurons.

Now we need to cluster this. That's for the next notebook, so again, I'll save the results.

---

In [4]:
with open('/home/jakni/Downloads/example/latent.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, latent)

---
Alright, let me show how to load the trained VAE given the model file we made above.

I want to **show** that we get the same network back that we trained, so let's try to feed it the same data twice.

---

In [5]:
import torch

# Manually create the first mini-batch without randomization
rpkms_in = torch.Tensor(rpkms[:128]).reshape((128, -1))
tnfs_in = torch.Tensor(tnfs[:128]).reshape((128, -1))

In [6]:
# Calling the VAE as a function encodes and decodes the arguments,
# returning the outputs and the two distribution layers
depths_out, tnf_out, mu, logsigma = vae(rpkms_in, tnfs_in)
print(mu[0])

tensor([-0.6341,  0.8006, -1.2837,  2.2479,  3.0842,  2.0184,  0.3096,
        -0.7320,  1.7008, -0.8898,  2.0501,  0.4636, -0.8683,  0.4024,
        -1.2859,  0.3301,  0.8071, -1.0957,  0.4424, -1.1223, -0.7120,
        -2.1790, -1.9727, -0.8413, -2.6715, -1.0463, -1.6019, -1.8441,
         1.7171,  0.3378, -0.8309,  1.1683,  2.1508, -2.0515, -0.3983,
        -0.4869, -1.5248, -1.6428,  2.3076, -1.4227])


In [8]:
# Now, delete the VAE
del vae

# And reload it:
# We need to manually specify whether it should use GPU or not
# And whether the network show begin in training or evaluation mode.
vae = vamb.encode.VAE.load('/tmp/model.pt', cuda=False, evaluate=True)
depths_out, tnf_out, mu, logsigma = vae(rpkms_in, tnfs_in)
print(mu[0])

tensor([-0.6341,  0.8006, -1.2837,  2.2479,  3.0842,  2.0184,  0.3096,
        -0.7320,  1.7008, -0.8898,  2.0501,  0.4636, -0.8683,  0.4024,
        -1.2859,  0.3301,  0.8071, -1.0957,  0.4424, -1.1223, -0.7120,
        -2.1790, -1.9727, -0.8413, -2.6715, -1.0463, -1.6019, -1.8441,
         1.7171,  0.3378, -0.8309,  1.1683,  2.1508, -2.0515, -0.3983,
        -0.4869, -1.5248, -1.6428,  2.3076, -1.4227])


---
We get the same values back, meaning the saved network is the same as the loaded network!