# Retention Time Prediction 

This notebook is prepared to be run in Google [Colaboratory](https://colab.research.google.com/). In order to train the model faster, please change the runtime of Colab to use Hardware Accelerator, either GPU or TPU.

This notebook presents a short walkthrough the process of reading a dataset and training a model for retention time prediction. The dataset is an example dataset extracted from a ProteomTools dataset generated in the **Chair of Bioanalytics** at the **School of Life Sciences** at the **Technical University of Munich**.

The framework being used is a custom wrapper on top of Keras/TensorFlow. The working name of the package is for now DLOmix -  `dlomix`.

This notebook illustrates briefly how to integrate [Weights and Biases](https://wandb.ai/) to track your experiments.

In [1]:
# install the DLOmix package in the current environment using pip

!python -m pip install git+https://github.com/wilhelm-lab/dlomix

Collecting git+https://github.com/wilhelm-lab/dlomix
  Cloning https://github.com/wilhelm-lab/dlomix to c:\users\micro\appdata\local\temp\pip-req-build-eoh_r4y1
  Resolved https://github.com/wilhelm-lab/dlomix to commit d256a271c9ee411855beb7cf7065b93dbcb11fcf
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting fpdf
  Downloading fpdf-1.7.2.tar.gz (39 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: dlomix, fpdf
  Building wheel for dlomix (pyproject.toml): started
  Building wheel for dlomix (pyproject.toml): finished with status 'done'
  Created wheel for dlomix: filename=dlomix-0.0.4-py3-none-any.whl siz

  Running command git clone --filter=blob:none --quiet https://github.com/wilhelm-lab/dlomix 'C:\Users\micro\AppData\Local\Temp\pip-req-build-eoh_r4y1'

[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
# install wandb via pip
!python -m pip install wandb

The available modules in the framework are as follows:

In [None]:
import numpy as np
import pandas as pd
import dlomix
from dlomix import constants, data, eval, layers, models, pipelines, reports, utils
print([x for x in dir(dlomix) if not x.startswith("_")])

- `constants`: constants to be used in the framework (e.g. Aminoacid alphabet mapping)
- `data`:  classes for representing dataset, wrappers around TensorFlow Dataset
- `eval`: custom evaluation metrics implemented in Keras/TF to work as `metrics` for model training
- `layers`: custom layer implementation required for the different models
- `models`: different model implementations for Retention Time Prediction
- `pipelines`: complete pipelines to run a task (e.g. Retention Time prediction)
- `utils`: helper modules

**Note**: reports and pipelines are work-in-progress, some funtionalities are not complete.

## 0. Import and Initialize Weights and Biases

In [None]:
# import wandb and the Keras Callback

import wandb
from wandb.keras import WandbCallback

In [None]:
# enter project name
project_name = 'retention time sample run'
wandb.init(project=project_name)

## 1. Load Data

We can import the dataset class and create an object of type `RetentionTimeDataset`. This object wraps around TensorFlow dataset objects for training+validation or for testing. This can be specified by the arguments `val_ratio` and `test`.

In [None]:
from dlomix.data import RetentionTimeDataset

In [None]:
TRAIN_DATAPATH = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix/develop/example_dataset/proteomTools_train_val.csv'
BATCH_SIZE = 64

rtdata = RetentionTimeDataset(data_source=TRAIN_DATAPATH,
                              seq_length=30, batch_size=BATCH_SIZE, val_ratio=0.2, test=False)

Now we have an RT dataset that can be used directly with standard or custom `Keras` models. This wrapper contains the splits we chose when creating it. In our case, they are training and validation splits. To get the TF Dataset, we call the attributes `.train_data` and `.val_data`. The length is now in batches (i.e. `total examples = batch_size x len`)

In [None]:
 "Training examples", BATCH_SIZE * len(rtdata.train_data)

In [None]:
"Validation examples", BATCH_SIZE * len(rtdata.val_data)

In [None]:
# if needed, add config params to wandb.config

config = wandb.config

config.seq_length = 30
config.batch_size = BATCH_SIZE
config.val_ratio = 0.2

## 2. Model

We can now create the model. We will use a simple Prediction with a conv1D encoder. It has the default working arguments, but most of the parameters can be customized.

**Note**: Important is to ensure that the padding length used for the dataset object is equal to the sequence length passed to the model.

In [None]:
from dlomix.models import RetentionTimePredictor

In [None]:
model = RetentionTimePredictor(seq_length=30)

## 3. Training

We can then train the model like a standard Keras model. The training parameters here are from Prosit, but other optimizer parameters can be used.  

In [None]:
#imports

from dlomix.eval import TimeDeltaMetric

In [None]:
# compile the model  with the optimizer and the metrics we want to use, we can add our custom timedelta metric

# you can also import tensorflow and build your custom optimizer object and pass it

model.compile(optimizer='adam', 
              loss='mse',
              metrics=['mean_absolute_error', TimeDeltaMetric()])

In [None]:
# add more parameters to config as per need

config.lr = 0.0001
config.optimizer = "adam"

We store the result of training so that we can explore the metrics and the losses later. We specify the number of epochs for training and pass the training and validation data as previously described.

At this point in a script or a notebook, the Callback for WandB is passed to `model.fit()` or similar functions accepting Callbacks (`model.fit_generator()`). 

Note that the warning is due to the choice of save format for the model, the arguments for the WandbCallback can be passed per preference and need. The documentation for `WandbCallback()` is available here: https://docs.wandb.ai/ref/python/integrations/keras/wandbcallback

In [None]:
# here we pass the WandbCallback to model.fit

history = model.fit(rtdata.train_data,
                    validation_data=rtdata.val_data,
                    epochs=5, callbacks=[WandbCallback()] )

## 3. Testing and Reporting

We can create a test dataset to test our model. Additionally, we can use the reporting module to produce plots and evaluate the model.

Note: the reporting module is still in progress and some functionalities might easily break.

In [None]:
# create the dataset object for test data

TEST_DATAPATH = 'https://raw.githubusercontent.com/wilhelm-lab/dlomix/develop/example_dataset/proteomTools_test.csv'

test_rtdata = RetentionTimeDataset(data_source=TEST_DATAPATH,
                              seq_length=30, batch_size=32, test=True)

In [None]:
# use model.predict from keras directly on the testdata

predictions = model.predict(test_rtdata.test_data)

# we use ravel from numpy to flatten the array (since it comes out as an array of arrays)
predictions = predictions.ravel()

In [None]:
# we can get the targets of a specific split to calcualte evaluation metrics against predictions
# the get_split_targets function from the RetentionTime dataset does this

test_targets = test_rtdata.get_split_targets(split="test")

In [None]:
test_targets, predictions

In [None]:
from dlomix.reports import RetentionTimeReport

In [None]:
# create a report object by passing the history object and plot different metrics
report = RetentionTimeReport(output_path="./output", history=history)

In [None]:
report.plot_keras_metric("loss")

In [None]:
report.plot_keras_metric("mean_absolute_error")

In [None]:
report.plot_keras_metric("timedelta")

In [None]:
# calculate R2  given the targets and the predictions of the test data
report.calculate_r2(test_targets, predictions)

In [None]:
report.plot_density(test_targets, predictions)

In [None]:
report.plot_residuals(test_targets, predictions, xrange=(-30, 30))

We can also produce a complete report with all the relevant plots in one PDF file by calling the `generate_report` function.

In [None]:
report.generate_report(test_targets, predictions)

## 4. Saving and Loading Models

Models can be saved normally the same Keras models would be saved. It is better to save the weights and the not the model since it makes it easier and more platform-indepdent when loading the model again. The extra step needed is to create a model object and then load the weights.

In [None]:
# save the model weights

save_path = "./output/rtmodel"
model.save_weights(save_path)

In [None]:
# models can be later loaded by creating a model object and then loading the weights

trained_model = RetentionTimePredictor(seq_length=30)
trained_model.load_weights(save_path)

We can compare the predictions to make sure that the model was loaded correctly.

In [None]:
new_predictions = trained_model.predict(test_rtdata.test_data)

In [None]:
new_predictions = new_predictions.ravel()

In [None]:
# confirm all old and new predictions are the same
np.allclose(predictions, new_predictions)

In [None]:
results_df = pd.DataFrame({"sequence": test_rtdata.sequences,
                           "irt": test_rtdata.targets,
                           "predicted_irt": predictions})

results_df.to_csv("./output/predictions_irt.csv", index=False)

In [None]:
pd.read_csv("./output/predictions_irt.csv")