## Step three: Train the autoencoder and encode input data

Again, you can use `help` to see how to use the module

`>>> help(vamb.encode)`

    Help on module vamb.encode in vamb:

    NAME
        vamb.encode - Encode a depths matrix and a tnf matrix to latent representation.

    DESCRIPTION
        Creates a variational autoencoder in PyTorch and tries to represent the depths
        and tnf in the latent space under gaussian noise.

        usage:
        >>> vae, dataloader = trainvae(depths, tnf) # Make & train VAE on Numpy arrays
        >>> latent = vae.encode(dataloader) # Encode to latent representation
        >>> latent.shape
        (183882, 40)
        
    [ lines elided ]
    
---
Aha, so we need to use the `trainvae` function first, then the `VAE.encode` method. You can call the `help` functions on those, but I'm not showing that here.

Training networks always take some time. If you have a GPU and CUDA installed, you can pass `cuda=True` to `encode.trainvae` to train on your GPU for increased speed. With a beefy GPU, this can make quite a difference. I run this on my laptop, so I'll just use my CPU.

Often, you'll want to reuse a pre-trained VAE. For this, I've added the `VAE.save` method of the VAE class, as well as a `VAE.load` method. In this example, I'll ask to write the trained model weights to a file in `/tmp` and show how to reload the VAE again.

In [1]:
# Again, we import stuff
import sys
sys.path.append('/home/jakni/Documents/scripts/')
import vamb

# And load the data we just saved - of course, if this wasn't in different
# notebooks, we could have just kept it in memory
with open('/home/jakni/Downloads/example/rpkms.npz', 'rb') as file:
    rpkms = vamb.vambtools.read_npz(file)
    
with open('/home/jakni/Downloads/example/tnfs.npz', 'rb') as file:
    tnfs = vamb.vambtools.read_npz(file)

In [10]:
# I'm training just 5 epochs for this demonstration.
# When actually using the VAE, 200-300 epochs are suitable
with open('/tmp/model', 'wb') as modelfile:
    vae, dataloader = vamb.encode.trainvae(rpkms, tnfs, nepochs=5, verbose=True, modelfile=modelfile)

Epoch: 1	Loss: 6.1650	BCE: 2.0077	CE: 0.0020501	MSE: 0.00531	KLD: 0.00940
Epoch: 2	Loss: 3.6706	BCE: 1.3009	CE: 0.0012164	MSE: 0.01227	KLD: 0.00911
Epoch: 3	Loss: 3.1486	BCE: 1.1498	CE: 0.0010405	MSE: 0.01821	KLD: 0.00891
Epoch: 4	Loss: 2.9865	BCE: 1.1019	CE: 0.0009851	MSE: 0.02245	KLD: 0.00865
Epoch: 5	Loss: 2.9117	BCE: 1.0800	CE: 0.0009592	MSE: 0.02556	KLD: 0.00851


---
The VAE encodes the high-dimensional (n_samples + 136 features) input data in a lower dimensional space (nlatent features). When training, it learns both the encoding scheme and attempts to reconstruct the input data given the latent representation influenced by gaussian noise.

The theory here is that the latent representation should be a more efficient encoding of the input data. If the input data for the contigs indeed do fall into bins, an efficient encoding should be to simply encode the bin they belong to, then use the "bin identity" to reconstruct the data. We add noise to prevent it from learning a huge number of slightly different bins, in the most extreme, each bin contains only one contig.

The loss of the VAE is the sum of three measures:

* Cross entropy (CE) measures the dissimilarity of the reconstructed abundances to observed abundances
* Mean squared error (MSE) measures the dissimilary of reconstructed versus observed TNF
* Kullback-Leibler divergence (KLD) measures the dissimilarity between the standard normal distribution and the distribution of encoded values with noise added

At least in principle, the latter term indudes the VAE to not crazily overfit by imposing some sensible prior on the kind of encodings it can choose.

CE is weighted by a factor of 3000 - this constant is pretty ad-hoc and just to make sure that the VAE does not prioritize MSE, which is high but fairly uninformative over CE, which is low but highly informative.

We can see the Mean Squared Error (which is the TNF-related loss) is rising these first 5 epochs, presumably as it sacrifices an efficient representation of the TNF in order to learn the depths (whose loss is CE) better. This happens sometimes, and it's alright - after all, co-abundance usually contain more information that TNF, and so we have chosen the CE to be several orders of magnitude higher than the MSE in order for the VAE to be able to make this choice.

Also, do note that binary cross entropy (BCE) is also reported. This is another common measure for dissimilarity of reconstruction. It is not used, just displayed for reference.

Okay, so now we have the trained `vae` and the `dataloader`. Let's feed the dataloader to the VAE in order to get the latent representation:

---

In [3]:
# No need to pass gpu=True to the encode function to encode on GPU
# If you trained the VAE on GPU, it already resides there
latent = vae.encode(dataloader)

print(latent.shape)

(39551, 40)


---
That's 39551 contigs each represented by the (non-noisy) value of 40 latent neurons.

Now we need to cluster this. That's for the next notebook, so again, I'll save the results.

---

In [5]:
with open('/home/jakni/Downloads/example/latent.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, latent)

---
Alright, let me show how to load the trained VAE given the model file we made above.

I want to **show** that we get the same network back that we trained, so let's try to feed it the same data twice.

---

In [39]:
import torch

rpkms_in = torch.Tensor(rpkms[0]).reshape((1, -1))
tnfs_in = torch.Tensor(tnfs[0]).reshape((1, -1))

In [40]:
depths_out, tnf_out, mu, logsigma = vae(rpkms_in, tnfs_in)
print(mu)

tensor([[ 4.0755, -4.0628,  0.7696,  2.6850,  2.8026,  1.4880, -0.9003,
         -3.3665, -2.1976,  4.9359, -4.4380,  2.6925,  4.7105,  1.4558,
          1.7984, -1.8792, -3.6274, -0.4964,  2.9361, -0.8234, -3.7444,
         -5.0871, -1.1772,  0.2317, -2.1899, -0.4474,  2.8878,  3.1479,
          1.8702, -3.7109,  4.4921, -0.4929, -2.2598, -3.8289, -4.3172,
          1.1155,  2.9243,  0.3677,  0.8327,  1.7788]])


In [43]:
# Now, delete the VAE
del vae

# And reload it:
# Annoyingly, PyTorch only works with paths, not with filehandles.
# We need to manually specify whether it should use GPU or not
# And whether the network show begin in training or evaluation mode
vae = vamb.encode.VAE.load('/tmp/model', cuda=False, evaluate=True)
depths_out, tnf_out, mu, logsigma = vae(rpkms_in, tnfs_in)
print(mu)

tensor([[ 4.0755, -4.0628,  0.7696,  2.6850,  2.8026,  1.4880, -0.9003,
         -3.3665, -2.1976,  4.9359, -4.4380,  2.6925,  4.7105,  1.4558,
          1.7984, -1.8792, -3.6274, -0.4964,  2.9361, -0.8234, -3.7444,
         -5.0871, -1.1772,  0.2317, -2.1899, -0.4474,  2.8878,  3.1479,
          1.8702, -3.7109,  4.4921, -0.4929, -2.2598, -3.8289, -4.3172,
          1.1155,  2.9243,  0.3677,  0.8327,  1.7788]])
