## Step three: Train the autoencoder and encode input data

Again, you can use `help` to see how to use the module

`>>> help(vamb.encode)`

    Help on module vamb.encode in vamb:

    NAME
        vamb.encode - Encode a depths matrix and a tnf matrix to latent representation.

    DESCRIPTION
        Creates a variational autoencoder in PyTorch and tries to represent the depths
        and tnf in the latent space under gaussian noise.

        usage:
        >>> vae, dataloader = trainvae(depths, tnf) # Make & train VAE on Numpy arrays
        >>> latent = vae.encode(dataloader) # Encode to latent representation
        >>> latent.shape
        (183882, 40)
        
    [ lines elided ]
    
---
Aha, so we need to use the `trainvae` function first, then the `VAE.encode` method. You can call the `help` functions on those, but I'm not showing that here.

Training networks always take some time. If you have a GPU and CUDA installed, you can pass `cuda=True` to `encode.trainvae` to train on your GPU for increased speed. With a beefy GPU, this can make quite a difference. I run this on my laptop, so I'll just use my CPU.

Often, you'll want to reuse a pre-trained VAE. For this, I've added the `VAE.save` method of the VAE class, as well as a `VAE.load` method. In this example, I'll ask to write the trained model weights to a file in `/tmp` and show how to reload the VAE again. But remember - a trained VAE only works on the dataset it's been trained on!

In [1]:
# Again, we import stuff
import sys
sys.path.append('/home/jakni/Documents/scripts/')
import vamb

# And load the data we just saved - of course, if this wasn't in different
# notebooks, we could have just kept it in memory
with open('/home/jakni/Downloads/example/rpkms.npz', 'rb') as file:
    rpkms = vamb.vambtools.read_npz(file)
    
with open('/home/jakni/Downloads/example/tnfs.npz', 'rb') as file:
    tnfs = vamb.vambtools.read_npz(file)

In [28]:
# I'm training just 5 epochs for this demonstration.
# When actually using the VAE, 200-300 epochs are suitable
with open('/tmp/model', 'wb') as modelfile:
    vae, dataloader = vamb.encode.trainvae(rpkms, tnfs, nepochs=5, verbose=True, modelfile=modelfile)

CE factor:  60.27594766753471
MSE factor:  1.0
CUDA: False
N latent:  40
N hidden:  325, 325
N contigs:  39551
N samples:  6
Time is:  2018-07-27 14:57:03.757052
Epoch: 1	Loss: 13.6593	CE: 0.20374	MSE: 0.92556	KLD: 0.45293
Epoch: 2	Loss: 9.0567	CE: 0.12438	MSE: 0.87441	KLD: 0.68541
Epoch: 3	Loss: 8.0029	CE: 0.10890	MSE: 0.79769	KLD: 0.64135
Epoch: 4	Loss: 7.5572	CE: 0.10388	MSE: 0.70308	KLD: 0.59264
Epoch: 5	Loss: 7.3596	CE: 0.10170	MSE: 0.66427	KLD: 0.56503


---
The VAE encodes the high-dimensional (n_samples + 136 features) input data in a lower dimensional space (nlatent features). When training, it learns both the encoding scheme and attempts to reconstruct the input data given the latent representation influenced by gaussian noise.

The theory here is that the latent representation should be a more efficient encoding of the input data. If the input data for the contigs indeed do fall into bins, an efficient encoding should be to simply encode the bin they belong to, then use the "bin identity" to reconstruct the data. We add noise to prevent it from learning a huge number of slightly different bins, in the most extreme, each bin contains only one contig.

The loss of the VAE is the sum of three measures:

* Cross entropy (CE) measures the dissimilarity of the reconstructed abundances to observed abundances
* Mean squared error (MSE) measures the dissimilary of reconstructed versus observed TNF
* Kullback-Leibler divergence (KLD) measures the dissimilarity between the standard normal distribution and the distribution of encoded values with noise added

At least in principle, the latter term indudes the VAE to not crazily overfit by imposing some sensible prior on the kind of encodings it can choose.

These terms are weigthed with the keyword arguments `errorsum` and `msefraction`. These numbers adjusts CE and MSE for a naïve network such that CE+MSE = `errorsum` and MSE/(CE+MSE) = `msefraction`. A naïve networks predicts MSE ~ N(0,1) and depth = 1/n_samples for every contig. Such a naïve network has an expected uncorrected CE of log(n_samples)/n_samples and an expected MSE of 2. Broadly speaking, `errorsum` gauges how much the network is allowed to learn - a low value constrains the latent layer to its prior - and `msefraction` gauges how much the network cares about TNF as opposed to depth.

We can see the KL-divergence rises right in the beginning as it learns the dataset and the latent layer drifts away from its prior. At epoch 3, the penalty associated with KL-divergence outweighs the CE and MSE losses, and the KL divergence falls.

Okay, so now we have the trained `vae` and the `dataloader`. Let's feed the dataloader to the VAE in order to get the latent representation:

---

In [29]:
# No need to pass gpu=True to the encode function to encode on GPU
# If you trained the VAE on GPU, it already resides there
latent = vae.encode(dataloader)

print(latent.shape)

(39551, 40)


---
That's 39551 contigs each represented by the (non-noisy) value of 40 latent neurons.

Now we need to cluster this. That's for the next notebook, so again, I'll save the results.

---

In [5]:
with open('/home/jakni/Downloads/example/latent.npz', 'wb') as file:
    vamb.vambtools.write_npz(file, latent)

---
Alright, let me show how to load the trained VAE given the model file we made above.

I want to **show** that we get the same network back that we trained, so let's try to feed it the same data twice.

---

In [30]:
import torch

rpkms_in = torch.Tensor(rpkms[:100]).reshape((100, -1))
tnfs_in = torch.Tensor(tnfs[:100]).reshape((100, -1))

In [37]:
depths_out, tnf_out, mu, logsigma = vae(rpkms_in, tnfs_in)
print(mu[0])

tensor([-0.2797, -0.5739, -0.8438, -1.7344, -0.5020,  1.3880, -0.1967,
        -1.4400, -1.2383,  1.2905,  1.4744,  0.1630,  1.3457, -0.3916,
         0.7508, -2.6650, -1.3275,  0.0155,  1.1138,  1.0715,  0.4415,
        -1.8217, -1.0472, -0.4817, -0.2983,  2.2535,  1.5923,  1.2273,
        -0.8661,  0.7274, -0.7957, -1.0649, -0.0674, -0.4920, -0.4447,
         0.3880,  1.2463,  2.1039,  0.9956,  0.2469])


In [38]:
# Now, delete the VAE
del vae

# And reload it:
# Annoyingly, PyTorch only works with paths, not with filehandles.
# We need to manually specify whether it should use GPU or not
# And whether the network show begin in training or evaluation mode.
# Also, we need to specify the errorsum and mseratio - we used the defaults,
# so I skip that here.
vae = vamb.encode.VAE.load('/tmp/model', cuda=False, evaluate=True)
depths_out, tnf_out, mu, logsigma = vae(rpkms_in, tnfs_in)
print(mu[0])

tensor([-0.2797, -0.5739, -0.8438, -1.7344, -0.5020,  1.3880, -0.1967,
        -1.4400, -1.2383,  1.2905,  1.4744,  0.1630,  1.3457, -0.3916,
         0.7508, -2.6650, -1.3275,  0.0155,  1.1138,  1.0715,  0.4415,
        -1.8217, -1.0472, -0.4817, -0.2983,  2.2535,  1.5923,  1.2273,
        -0.8661,  0.7274, -0.7957, -1.0649, -0.0674, -0.4920, -0.4447,
         0.3880,  1.2463,  2.1039,  0.9956,  0.2469])
