# Proof of Concept - VAEP 
Variational Autoencoder of the Proteome (VAEP), reconstructiong samples on the peptide level using `log`-transformed on peptide intensities. This is the proof of concept (POC) for later use. 

- Fit VAE to Hela-Sample data (3 samples) and overfit. (Functional test of code)
- Fit 

### Handling Missing values
In this semi-supervised setting, where the samples are both input and target, missing values have to be imputed in the sample for the input space, but these values should not be considered for the loss function as their truth is unkown. 

### Alternatives

- [`sklearn.imputer.IterativeImputer`](https://scikit-learn.org/stable/modules/impute.html#iterative-imputer)

In [1]:
import pandas as pd

import torch
import torch.utils.data
from torch import nn, optim
from torch.nn import functional as F

import vaep
from vaep.transform import log

## Load Data

In [2]:
import src.file_utils as io_
FOLDER_DATA = 'data'
files = io_.search_files(path=FOLDER_DATA, query='.pkl')
file = io_.check_for_key(files, 'peptides_n4') # ToDo: check for more than one key behaviour
file # sample_peptides.pkl

'data\\sample_peptides_n4.pkl'

In [3]:
peptides = pd.read_pickle(file)
peptides = peptides.apply(log)

In [4]:
REMOVE_MISSING = True
if REMOVE_MISSING:
    mask = peptides.isna().sum() == 0
    peptides = peptides.loc[:,mask]
peptides

Sequence,AAAAAAAAAPAAAATAPTTAATTAATAAQ,AAAAAAALQAK,AAAAAAGAASGLPGPVAQGLK,AAAAAAGAGPEMVR,AAAAAGTATSQRFFQSFSDALIDEDPQAALEELTK,AAAAASRGVGAK,AAAAECDVVMAATEPELLDDQEAK,AAAAVVVPAEWIK,AAAEVAGQFVIK,AAAFEEQENETVVVK,...,YWSQQIEESTTVVTTQSAEVGAAETTLTELRR,YYALCGFGGVLSCGLTHTAVVPLDLVK,YYDQICSIEPK,YYDVMSDEEIER,YYNSDVHR,YYRVCTLAIIDPGDSDIIR,YYTEFPTVLDITAEDPSK,YYTPVPCESATAK,YYTSASGDEMVSLK,YYVTIIDAPGHR
MQ1.6.0.1_20190103_QE8_nLC0_LiNi_QC_MNT_15cm_Hela_01_200327,20.709441,21.698618,18.53392,17.479482,17.200782,18.42168,18.269334,18.525491,18.562701,17.958788,...,17.82066,19.497933,18.332658,17.023568,17.797601,19.440322,18.273062,17.12733,19.731605,19.866894
MQ1.6.1.12_20190103_QE8_nLC0_LiNi_QC_MNT_15cm_Hela_01_200330,20.626997,21.698618,18.53392,17.479482,17.200782,18.42168,18.269334,18.525491,18.395158,17.958788,...,17.82066,19.497933,18.332658,17.023568,17.73037,19.440322,18.273062,17.12733,19.731605,19.866894
MQ1.6.1.12_20190103_QE8_nLC0_LiNi_QC_MNT_15cm_Hela_01_20190104110509_200331,20.922936,21.616247,19.263712,17.635352,18.393957,18.658673,18.543252,18.802195,17.974722,17.683394,...,20.778795,21.700765,18.648215,17.991343,19.239403,21.242119,18.540151,17.254261,20.077269,18.575717
MQ1.6.1.12_20190103_QE8_nLC0_LiNi_QC_MNT_15cm_Hela_02_200331,20.771484,21.870383,18.323883,17.344691,19.156265,18.506767,18.433498,18.377303,18.511435,18.089896,...,20.136207,21.329656,18.422878,17.171745,19.026798,19.965857,18.210873,17.451912,19.815315,18.134424


In [5]:
IMPUTE = False
if IMPUTE:
    from vaep.imputation import imputation_normal_distribution
    imputed = peptides.iloc[:,:10].apply(imputation_normal_distribution)
    imputed    

In [6]:
n_samples, n_features = peptides.shape

Impute missing values as 0?

In [7]:
detection_limit = float(int(peptides.min().min()))
detection_limit 

15.0

In [8]:
peptides.fillna(detection_limit, inplace=True)

### Data Loading

In [9]:
dataset_in_memory = peptides.values
dataset_in_memory = torch.from_numpy(dataset_in_memory)
dataset_in_memory

tensor([[20.7094, 21.6986, 18.5339,  ..., 17.1273, 19.7316, 19.8669],
        [20.6270, 21.6986, 18.5339,  ..., 17.1273, 19.7316, 19.8669],
        [20.9229, 21.6162, 19.2637,  ..., 17.2543, 20.0773, 18.5757],
        [20.7715, 21.8704, 18.3239,  ..., 17.4519, 19.8153, 18.1344]],
       dtype=torch.float64)

A Dataset needs a the methods `__len__` and `__getitem__, so it can be feed to a `DataLoader`, this mean the following has to work

In [10]:
len(dataset_in_memory)

4

In [11]:
dataset_in_memory[1]

tensor([20.6270, 21.6986, 18.5339,  ..., 17.1273, 19.7316, 19.8669],
       dtype=torch.float64)

## PyTorch Implementation of VAE

### Default Command Line Arguments
- later parameters will be passed a final program

In [12]:
from vaep.cmd import parse_args
args = parse_args(['--batch-size', '2', '--no-cuda', '--seed', '43', '--epochs', '300'])
args

Namespace(batch_size=2, cuda=False, epochs=300, log_interval=10, no_cuda=True, seed=43)

### Create a DataLoader instance
Passing the DataSet instance in memory to the DataLoader creates a generator for training which shuffles the data on training.

In [13]:
torch.manual_seed(args.seed)

device = torch.device("cuda" if args.cuda else "cpu")

In [14]:
kwargs = {'num_workers': 1, 'pin_memory': True} if args.cuda else {}
train_loader = torch.utils.data.DataLoader(
    dataset=dataset_in_memory,
    batch_size=args.batch_size, shuffle=True, **kwargs)
# test_loader = torch.utils.data.DataLoader(
#     datasets.MNIST('../data', train=False, transform=transforms.ToTensor()),
#     batch_size=args.batch_size, shuffle=True, **kwargs)

Iterate over the data:

In [15]:
for data in train_loader:
    print("Nummber of samples in mini-batch: {}".format(len(data)),
          "\tObject-Type: {}".format(type(data)))

Nummber of samples in mini-batch: 2 	Object-Type: <class 'torch.Tensor'>
Nummber of samples in mini-batch: 2 	Object-Type: <class 'torch.Tensor'>


### VAE Model

In [16]:
F.mse_loss?

[1;31mSignature:[0m [0mF[0m[1;33m.[0m[0mmse_loss[0m[1;33m([0m[0minput[0m[1;33m,[0m [0mtarget[0m[1;33m,[0m [0msize_average[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mreduce[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mreduction[0m[1;33m=[0m[1;34m'mean'[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
mse_loss(input, target, size_average=None, reduce=None, reduction='mean') -> Tensor

Measures the element-wise mean squared error.

See :class:`~torch.nn.MSELoss` for details.
[1;31mFile:[0m      c:\users\kzl465\anaconda3\envs\vaep\lib\site-packages\torch\nn\functional.py
[1;31mType:[0m      function


In [17]:
F.mse_loss??

[1;31mSignature:[0m [0mF[0m[1;33m.[0m[0mmse_loss[0m[1;33m([0m[0minput[0m[1;33m,[0m [0mtarget[0m[1;33m,[0m [0msize_average[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mreduce[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mreduction[0m[1;33m=[0m[1;34m'mean'[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
[1;32mdef[0m [0mmse_loss[0m[1;33m([0m[0minput[0m[1;33m,[0m [0mtarget[0m[1;33m,[0m [0msize_average[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mreduce[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mreduction[0m[1;33m=[0m[1;34m'mean'[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;31m# type: (Tensor, Tensor, Optional[bool], Optional[bool], str) -> Tensor[0m[1;33m
[0m    [1;34mr"""mse_loss(input, target, size_average=None, reduce=None, reduction='mean') -> Tensor

    Measures the element-wise mean squared error.

    See :class:`~torch.nn.MSELoss` for details.
    """[0m[1;33m
[0m    [1;32mif[0m [1;32mnot[0m [1;33m([

In [18]:
from IPython.core.debugger import set_trace
class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()
        
        n_neurons = 1000

        self.fc1 = nn.Linear(n_features, n_neurons)
        self.fc21 = nn.Linear(n_neurons, 50)
        self.fc22 = nn.Linear(n_neurons, 50)
        self.fc3 = nn.Linear(50, n_neurons)
        self.fc4 = nn.Linear(n_neurons, n_features)

    def encode(self, x):
        h1 = F.relu(self.fc1(x))
        return self.fc21(h1), self.fc22(h1)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
#         print("std: ", std)
#         print("eps: ", eps)
        return mu + eps*std

    def decode(self, z):
        h3 = F.relu(self.fc3(z))
        return self.fc4(h3)
#         return torch.sigmoid(self.fc4(h3))

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, n_features))
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar


model = VAE().to(device)
model.double()
optimizer = optim.Adam(model.parameters(), lr=1e-5)


# Reconstruction + KL divergence losses summed over all elements and batch
def loss_function(recon_x, x, mu, logvar):
    #BCE = F.binary_cross_entropy(recon_x, x.view(-1, n_features), reduction='sum')
    MSE = F.mse_loss(input=recon_x, target=x.view(-1, n_features), reduction='sum')
    
#     mask = x != detection_limit
#     MSE = torch.sum(((recon_x-x)*mask)**2.0)  / torch.sum(mask)
    # debug_here()
    # import pdb; pdb.set_trace()
    # see Appendix B from VAE paper:
    # Kingma and Welling. Auto-Encoding Variational Bayes. ICLR, 2014
    # https://arxiv.org/abs/1312.6114
    # 0.5 * sum(1 + log(sigma^2) - mu^2 - sigma^2)
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    return 0.9*MSE + 0.1*KLD


def train(epoch):
    model.train()
    train_loss = 0
#     for batch_idx, (data, _) in enumerate(train_loader):
    for batch_idx, data in enumerate(train_loader):
        data = data.to(device)
        optimizer.zero_grad()
        recon_batch, mu, logvar = model(data)
        print("Mean of mu: ", mu.mean())
        loss = loss_function(recon_batch, data, mu, logvar)
        loss.backward()
        train_loss += loss.item()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader),
                loss.item() / len(data)))

    print('====> Epoch: {} Average loss: {:.4f}'.format(
          epoch, train_loss / len(train_loader.dataset)))


def test(epoch):
    model.eval()
    test_loss = 0
    with torch.no_grad():
        for i, (data, _) in enumerate(test_loader):
            data = data.to(device)
            recon_batch, mu, logvar = model(data)
            test_loss += loss_function(recon_batch, data, mu, logvar).item()
#             if i == 0:
#                 n = min(data.size(0), 8)
#                 comparison = torch.cat([data[:n],
#                                       recon_batch.view(args.batch_size, 1, 28, 28)[:n]])
#                 save_image(comparison.cpu(),
#                          'results/reconstruction_' + str(epoch) + '.png', nrow=n)

    test_loss /= len(test_loader.dataset)
    print('====> Test set loss: {:.4f}'.format(test_loss))


In [19]:
if __name__ == "__main__":
    for epoch in range(1, args.epochs + 1):
        train(epoch)
#         test(epoch)
#         with torch.no_grad():
#             sample = torch.randn(64, 20).to(device)
#             sample = model.decode(sample).cpu()
#             save_image(sample.view(64, 1, 28, 28),
#                        filename_img_reconst(epoch))

Mean of mu:  tensor(-0.1897, dtype=torch.float64, grad_fn=<MeanBackward0>)
Mean of mu:  tensor(-0.2156, dtype=torch.float64, grad_fn=<MeanBackward0>)
====> Epoch: 1 Average loss: 2982663.4148
Mean of mu:  tensor(-0.2573, dtype=torch.float64, grad_fn=<MeanBackward0>)
Mean of mu:  tensor(-0.2450, dtype=torch.float64, grad_fn=<MeanBackward0>)
====> Epoch: 2 Average loss: 1906956.2212
Mean of mu:  tensor(-0.2525, dtype=torch.float64, grad_fn=<MeanBackward0>)
Mean of mu:  tensor(-0.2877, dtype=torch.float64, grad_fn=<MeanBackward0>)
====> Epoch: 3 Average loss: 1862520.6568
Mean of mu:  tensor(-0.2700, dtype=torch.float64, grad_fn=<MeanBackward0>)
Mean of mu:  tensor(-0.3002, dtype=torch.float64, grad_fn=<MeanBackward0>)
====> Epoch: 4 Average loss: 1857288.3679
Mean of mu:  tensor(-0.3063, dtype=torch.float64, grad_fn=<MeanBackward0>)
Mean of mu:  tensor(-0.2926, dtype=torch.float64, grad_fn=<MeanBackward0>)
====> Epoch: 5 Average loss: 1852205.1566
Mean of mu:  tensor(-0.2971, dtype=torch

In [20]:
for batch_idx, data in enumerate(train_loader):
    data = data.to(device)
    optimizer.zero_grad()
    recon_batch, mu, logvar = model(data)
    break


In [21]:
print(recon_batch)
print(data)
print(mu, logvar)

tensor([[20.5665, 20.8438, 19.1352,  ..., 17.6067, 19.3815, 19.0561],
        [20.4854, 20.8134, 19.1559,  ..., 18.4930, 20.4946, 19.4191]],
       dtype=torch.float64, grad_fn=<AddmmBackward>)
tensor([[20.7715, 21.8704, 18.3239,  ..., 17.4519, 19.8153, 18.1344],
        [20.9229, 21.6162, 19.2637,  ..., 17.2543, 20.0773, 18.5757]],
       dtype=torch.float64)
tensor([[-2.2690e+01, -2.6498e+01, -2.8529e+01,  2.0720e+01,  3.4609e+01,
         -2.2813e+01,  2.3199e+01, -1.3214e+01, -2.9610e+01,  2.5413e+01,
         -2.8270e+01,  1.6048e+01,  2.6857e+01,  3.8466e+01,  6.2390e+00,
         -2.9774e+01,  1.5713e+01, -2.8309e+01,  3.0123e+01, -2.7311e+01,
          2.8372e+01, -3.4503e+01,  9.9211e+00,  2.9404e+01, -2.2419e+01,
         -1.4783e+01,  2.7599e+01,  3.4626e+01, -3.6364e+01, -2.6622e+01,
          2.6771e+01,  2.6931e+01, -2.8425e+01,  3.8790e+00,  2.2791e+01,
          3.5134e+01, -2.9265e+01,  5.6496e-01,  1.3300e+01, -3.1215e+01,
          1.4574e+01, -2.3281e+01, -2.7162e+0