# Experiment 02


The current approach for creating train, validation and test data sets is to split 
the data in long-format, i.e. one observation is an intensity value from one sample representing one peptide, into the desired splits. In this process missing values are not regarded.

- [x] mask entries in larger dataset in long-format
- [x] mask peptides based on their frequency in samples (probability of being observed)
- [x] create *long-format* training data set without masked values for each model
    - FNN based on embeddings of peptides and samples (long-format **without** missing values)
    - Denoising AE (wide-format **with** missing values)
    - VAE (wide-format **with** missing values)
- [ ] restrict to only a training data split of consective data: Increase number of samples.
    - focus on best reconstruction performance
    - mean comparison

### Collaborative Filtering model

- Cannot accomodate iid assumption of statistical test in current setup for embedding vectors.
  - if pretrained model should be applied to an new batch of replicates (with a certain condition) one would need to find a way to initialize the sample embeddings without fine-tuning the model

In [None]:
from pprint import pprint
from src.nb_imports import *


import vaep.io_images
import seaborn

import numpy.testing as npt # fastcore.test functionality

from pathlib import Path
from src import metadata


import logging
from src.logging import setup_logger

logger = setup_logger(logger=logging.getLogger('vaep'))
logger.info("Experiment 02")

figures = {}  # collection of ax or figures

In [None]:
# None takes all
N_SAMPLES : int = 500
ADD_TENSORBOARD : bool = False
FN_PEPTIDE_INTENSITIES : Path = (config.FOLDER_DATA / 'df_intensities_N07285_M01000') # 90%
epochs_max = 5
#write to read only config ? namedtuple?

## Raw data

In [None]:
FN_PEPTIDE_INTENSITIES = Path(FN_PEPTIDE_INTENSITIES)

In [None]:
analysis = AnalyzePeptides(fname=FN_PEPTIDE_INTENSITIES, nrows=None)
analysis.df.columns.name = 'peptide'
analysis.log_transform(np.log2)
analysis

In [None]:
# some date are not possible in the indices
rename_indices_w_wrong_dates = {'20161131_LUMOS1_nLC13_AH_MNT_HeLa_long_03': '20161130_LUMOS1_nLC13_AH_MNT_HeLa_long_03',
                                '20180230_QE10_nLC0_MR_QC_MNT_Hela_12': '20180330_QE10_nLC0_MR_QC_MNT_Hela_12',
                                '20161131_LUMOS1_nLC13_AH_MNT_HeLa_long_01': '20161130_LUMOS1_nLC13_AH_MNT_HeLa_long_01',
                                '20180230_QE10_nLC0_MR_QC_MNT_Hela_11': '20180330_QE10_nLC0_MR_QC_MNT_Hela_11',
                                '20161131_LUMOS1_nLC13_AH_MNT_HeLa_long_02': '20161130_LUMOS1_nLC13_AH_MNT_HeLa_long_02'}
analysis.df.rename(index=rename_indices_w_wrong_dates, inplace=True)

### Select N consecutive samples

In [None]:
analysis.get_consecutive_dates(n_samples=N_SAMPLES)

In [None]:
assert not analysis.df._is_view

- biological stock differences in PCA plot. Show differences in models. Only see biological variance

In [None]:
analysis.add_metadata()

In [None]:
analysis.df_meta.describe()

Use to find date parsing errors, used for renaming above.

In [None]:
# invalid_dates = pd.to_datetime(analysis.df_meta.date, errors='coerce').isna()
# display(analysis.df_meta.loc[invalid_dates])
# {i : i for i in analysis.df_meta.loc[invalid_dates].index} # to rename

Find rare instrument types (potential labeling errors)

In [None]:
# N_MIN_INSTRUMENT = 10
# ms_instruments = analysis.df_meta.ms_instrument.value_counts()
# ms_instruments = ms_instruments[ms_instruments > N_MIN_INSTRUMENT].index
# mask = ~analysis.df_meta.ms_instrument.isin(ms_instruments)
# analysis.df_meta.loc[mask]

### PCA plot of raw data

In [None]:
fig = analysis.plot_pca()

In [None]:
vaep.io_images._savefig(fig, config.FIGUREFOLDER /
                        f'pca_plot_raw_data_{analysis.fname_stub}')

Scatter plots need to become interactive.

## Long format

- Data in long format: (peptide, sample_id, intensity)
- no missing values kept
- 

In [None]:
analysis.df_long.head()

In [None]:
assert analysis.df_long.isna().sum().sum(
) == 0, "There are still missing values in the long format."

In [None]:
analysis.df_wide.head()

In [None]:
assert analysis.df_wide.isna().sum().sum(
) > 0, "There are no missing values left in the wide format"

### Sampling peptides by their frequency (important for later)

- higher count, higher probability to be sampled into training data
- missing peptides are sampled both into training as well as into validation dataset
- everything not in training data is validation data

In [None]:
# freq_per_peptide = analysis.df.unstack().to_frame('intensity').reset_index(1, drop=True)
freq_per_peptide = analysis.df_long['intensity']
freq_per_peptide = freq_per_peptide.notna().groupby(level=0).sum()

In [None]:
# df_long = analysis.df.unstack().to_frame('intensity').reset_index(1)
analysis.df_train = analysis.df_long.groupby(
    by='Sample ID').sample(frac=0.95, weights=freq_per_peptide, random_state=42)
analysis.df_train = analysis.df_train.reset_index().set_index([
    'Sample ID', 'peptide'])
analysis.df_train

## MultiIndex 

- use mulitindex for obtaining validation split

In [None]:
analysis._df_long = analysis.df_long.reset_index(
).set_index(['Sample ID', 'peptide'])
analysis.df_long.head()

In [None]:
analysis.indices_valid = analysis.df_long.index.difference(
    analysis.df_train.index)
analysis.df_valid = analysis.df_long.loc[analysis.indices_valid]

In [None]:
assert len(analysis.df_long) == len(analysis.df_train) + len(analysis.df_valid)

## Setup DL

In [None]:
import vaep.model as vaep_model
from vaep.cmd import get_args

BATCH_SIZE, EPOCHS = 128, 30
args = get_args(batch_size=BATCH_SIZE, epochs=EPOCHS,
                no_cuda=False)  # data transfer to GPU seems slow
kwargs = {'num_workers': 2, 'pin_memory': True} if args.cuda else {}

# torch.manual_seed(args.seed)
device = torch.device("cuda" if args.cuda else "cpu")
device

print(args, device, sep='\n')

Fastai default device for computation

In [None]:
import fastai.torch_core
# # device = torch.device('cpu')
# # fastai.torch_core.defaults.device = torch.device('cpu')
# device = fastai.torch_core.defaults.device
# device
torch.cuda.is_available()

In [None]:
fastai.torch_core.defaults

### Comparison data

- first impute first and last row (using n=3 replicate)
- use pandas interpolate

In [None]:
from numpy import nan
test_data = {
    "pep1": {0: nan, 1: 27.8, 2: 28.9, 3: nan, 4: 28.7},
    "pep2": {0: 29.1, 1: nan, 2: 27.6, 3: 29.1, 4: nan},
}

test_data = pd.DataFrame(test_data)
mask = test_data.isna()

# floating point problem: numbers are not treated as decimals
expected = {
    (0, 'pep1'): (27.8 + 28.9) / 2,
    (1, "pep2"): (29.1 + 27.6) / 2,
    (3, "pep1"): (28.9 + 28.7) / 2,
    (4, "pep2"): (27.6 + 29.1) / 2,
}
            
def interpolate(wide_df: pd.DataFrame, name='replicates'):
    """Interpolate NA values with the values before and after.
    Uses n=3 replicates.
    First rows replicates are the two following. 
    Last rows replicates are the two preceding.
    """
    mask = wide_df.isna()
    first_row = wide_df.iloc[0]
    last_row = wide_df.iloc[-1]
    
    m = first_row.isna()
    first_row.loc[m] = wide_df.iloc[1:3, m.to_list()].mean()
    
    m = last_row.isna()
    last_row.loc[m] = wide_df.iloc[-3:-1, m.to_list()].mean()
    
    ret = wide_df.interpolate(method='linear', limit_direction='forward', axis=0)
    ret.iloc[0] = first_row
    ret.iloc[-1] = last_row
    
    ret = ret[mask].stack().dropna()
    ret.rename(name, inplace=True)
    return ret


actual = interpolate(test_data).to_dict()
assert actual == expected

In [None]:
analysis.median_train = analysis.df_train['intensity'].unstack().median()
analysis.median_train.name = 'train_median'
analysis.averag_train = analysis.df_train['intensity'].unstack().mean()
analysis.averag_train.name = 'train_average'

df_pred = analysis.df_valid.copy()

df_pred = df_pred.join(analysis.median_train, on='peptide')
df_pred = df_pred.join(analysis.averag_train, on='peptide')


_ = interpolate(wide_df = analysis.df_train['intensity'].unstack())
df_pred = df_pred.join(_)

df_pred

## Collaboritive filtering model

In [None]:
from fastai.collab import CollabDataLoaders, MSELossFlat, Learner

analysis.collab = Analysis()
collab = analysis.collab
collab.columns = 'peptide,Sample ID,intensity'.split(',')

Create data view for collaborative filtering

- currently a bit hacky as the splitter does not support predefinded indices (create custum subclass providing splits to internal methods?)

- Use the [`CollabDataLoaders`](https://docs.fast.ai/collab.html#CollabDataLoaders)  similar to the [`TabularDataLoaders`](https://docs.fast.ai/tabular.data.html#TabularDataLoaders).
- Use the [`IndexSplitter`](https://docs.fast.ai/data.transforms.html#IndexSplitter) and provide splits to whatever is used in `CollabDataLoaders`


In [None]:
collab.df_train = analysis.df_train.reset_index()
collab.df_valid = analysis.df_valid.reset_index()
collab.df_train.head()

In [None]:
collab.df_valid.head()

In [None]:
assert (collab.df_train.intensity.isna().sum(),
        collab.df_valid.intensity.isna().sum()) == (0, 0), "Remove missing values."

Hacky part uses training data `Datasets` from dataloaders to recreate a custom `DataLoaders` instance

In [None]:
collab.dl_train = CollabDataLoaders.from_df(
    collab.df_train, valid_pct=0.0, user_name='Sample ID', item_name='peptide', rating_name='intensity', bs=args.batch_size, device=device)
collab.dl_valid = CollabDataLoaders.from_df(
    collab.df_valid, valid_pct=0.0, user_name='Sample ID', item_name='peptide', rating_name='intensity', bs=args.batch_size,
    shuffle=False, device=device)
collab.dl_train.show_batch()

In [None]:
from fastai.data.core import DataLoaders
collab.dls = DataLoaders(collab.dl_train.train, collab.dl_valid.train)
if args.cuda:
    collab.dls.cuda()

In [None]:
collab.dl_valid.show_batch()

In [None]:
len(collab.dls.classes['Sample ID']), len(collab.dls.classes['peptide'])

In [None]:
len(collab.dls.train), len(collab.dls.valid)  # mini-batches

Alternatively to the hacky version, one could use a factory method, but there the sampling/Splitting methods would need to be implemented (not using [`RandomSplitter`](https://docs.fast.ai/data.transforms.html#RandomSplitter) somehow)

 - [`TabDataLoader`](https://docs.fast.ai/tabular.core.html#TabDataLoader)
 - uses [`TabularPandas`](https://docs.fast.ai/tabular.core.html#TabularPandas)
 
 > Current problem: No custom splitter can be provided

In [None]:
# valid_idx = [analysis.df_long.index.get_loc(key=key) for key in analysis.indices_valid]
# splitter = IndexSplitter([valid_idx])
# splitter(collab.df.index)

# # replace in CollabDataloaders for getting splits
# # or directly in TabularCollab
# CollabDataLoaders??

In [None]:
# # drop NAs before?
#
# from fastai.tabular.all import *
# from fastai.tabular.data import TabularDataLoaders
# collab.dls = TabularDataLoaders.from_df(
#     df=analysis.df_long.reset_index(),
#     procs=[Categorify],
#     valid_idx=valid_idx,
#     cat_names=['Sample ID', 'peptide'],
#     y_names=['intensity'],
#     with_cont=False,
#     y_block=TransformBlock(),
#     bs=64)
# collab.dls.show_batch()
# # Problem: this return a second empty df - > would need to adapt model.

A brief check that the values match roughly

In [None]:
# from numpy.testing import assert_almost_equal
# UPTODECIMAL = 5
# assert_almost_equal(
#     collab.dls.valid_ds['intensity'].values,
#     analysis.df_long.iloc[valid_idx]['intensity'],
#     decimal=UPTODECIMAL
# )
# print(f"Values match up to the {UPTODECIMAL} decimal.")

### Model

In [None]:
collab.model_args = {}
collab.model_args['n_samples'] = len(collab.dls.classes['Sample ID'])
collab.model_args['n_peptides'] = len(collab.dls.classes['peptide'])
collab.model_args['dim_latent_factors'] = 20
collab.model_args['y_range'] = (
    int(analysis.df_train['intensity'].min()), int(analysis.df_train['intensity'].max())+1)

print("Args:")
pprint(collab.model_args)


model = vaep_model.DotProductBias(**collab.model_args)
learn = Learner(dls=collab.dls, model=model, loss_func=MSELossFlat())
if args.cuda:
    learn.cuda()
learn.summary()

### Training

In [None]:
learn.fit_one_cycle(epochs_max, 5e-3)

### Evaluation

In [None]:
collab.dls.valid_ds.items

In [None]:
# import pandas as pd
# dtype = pd.CategoricalDtype(collab.dls.classes['peptide'], ordered=False)
# pd.Categorical.from_codes(codes=collab.dls.valid_ds.items['Sample ID'], dtype=dtype)

In [None]:
# show False does not return results..
res = learn.show_results(show=True)  # something similar with return

In [None]:
df_pred = df_pred.reset_index()
pred, target = learn.get_preds()
df_pred['intensity_pred_collab'] = pd.Series(
    pred.flatten().numpy(), index=collab.dls.valid.items.index)

npt.assert_almost_equal(
    actual=collab.dls.valid.items.intensity.to_numpy(),
    desired=target.numpy().flatten()
)


def cast_object_to_category(df):
    """Object to category dtype."""
    _columns = df.select_dtypes(include='object').columns
    return df.astype({col: 'category' for col in _columns})


df_pred = cast_object_to_category(df_pred)
df_pred.set_index(['Sample ID', 'peptide'], inplace=True)
df_pred

In [None]:
# # Adapt to get prediction Dataframe
# encodings, pred, target = learn.get_preds(
#     with_input=True)  # per default validation data
# pred_df = pd.DataFrame([{'Sample ID': collab.dls.classes['Sample ID'][obs[0]], 'peptide': collab.dls.classes['peptide']
#                          [obs[1]], 'intensity_pred_collab': pred_intensity.item(), 'intensity': orig_intensity.item() } for obs, pred_intensity, orig_intensity in zip(encodings, pred, target)])
# # pred_df = pred_df.pivot(index='Sample ID', columns='peptide')
# pred_df.sort_values(by=['Sample ID', 'peptide'])

In [None]:
(abs(target - pred)).sum() / len(target)

## Denoising Autoencoder (DAE)

### Custom Transforms

- [x] Shift standard normalized data around
    - Error metrics won't be directly comparable afterwards

In [None]:
from fastai.tabular.all import *

In [None]:
from fastcore.basics import store_attr
from fastcore.imports import noop
from fastai.data.transforms import Normalize, broadcast_vec


class NormalizeShiftedMean(Normalize):
    "Normalize/denorm batch of `TensorImage` with shifted mean and scaled variance."

    def __init__(self, mean=None, std=None, axes=(0, 2, 3),
                 shift_mu=0.5, scale_var=2): store_attr()

    def setups(self, to: Tabular):
        store_attr(but='to', means=dict(getattr(to, 'train', to).conts.mean()),
                   stds=dict(getattr(to, 'train', to).conts.std(ddof=0)+1e-7))
        self.shift_mu = 0.5
        self.scale_var = 2
        return self(to)

    def encodes(self, to: Tabular):
        to.conts = (to.conts-self.means) / self.stds
        to.conts = to.conts / self.scale_var + self.shift_mu
        return to

    def decodes(self, to: Tabular):
        to.conts = (to.conts - self.shift_mu) * self.scale_var
        to.conts = (to.conts*self.stds) + self.means
        return to

    _docs = dict(encodes="Normalize batch with shifted mean and scaled variance",
                 decodes="Normalize batch with shifted mean and scaled variance")


# test with Tabular data somehow
test_data_view = analysis.df.iloc[:100, :10]

# norm_shifted = Normalize.from_stats(mean=test_data_view.mean(), std=test_data_view.std())
# norm_shifted = NormalizeShiftedMean.from_stats(mean=test_data_view.mean(), std=test_data_view.std())

procs = [NormalizeShiftedMean, FillMissing(add_col=True)]
cont_names = list(set(test_data_view))
to = TabularPandas(test_data_view, procs=procs, cat_names=None,
                   cont_names=cont_names, y_names=None, splits=None, y_block=None, do_setup=True)
to.items

FastAi uses singledispatch internally to modify "Normalization" dependent on the type annotations!

  - see `Transform` [docs](https://fastcore.fast.ai/transform.html#Transform) (`Transform` itself is part of fastcore library)
 
  - see the `Transform`'s meta class [`__call__` function](https://github.com/fastai/fastcore/blob/ae8148c85a0c57cc7fd6aa29fa13bdbfbe59be22/fastcore/transform.py#L33-L39) for the "singledispatch"/"typedispatch" functionality shown below 
  - `Normalize` typedispatch added for each application, [here `Tabular`](https://github.com/fastai/fastai/blob/99d38fec7207db9b4209568bebc85ded7e3d3f1b/fastai/tabular/core.py#L269-L283)
  - checkout [`TabularProc`](https://docs.fast.ai/tabular.core.html#TabularProc) and [`InPlaceTransform`](https://fastcore.fast.ai/transform.html#InplaceTransform)

> Check what happens if one removes `to:Tabular`

In [None]:
# #This would replace the Normalization for Tabular objects
# @Normalize
# def setups(self, to:Tabular):
#     store_attr(but='to', means=dict(getattr(to, 'train', to).conts.mean()),
#                stds=dict(getattr(to, 'train', to).conts.std(ddof=0)+1e-7))
#     self.shift_mu = 0.5
#     self.scale_var = 2
#     return self(to)

# @Normalize
# def encodes(self, to:Tabular):
#     to.conts = (to.conts-self.means) / self.stds
#     to.conts =  to.conts / self.scale_var + self.shift_mu
#     return to

# @Normalize
# def decodes(self, to:Tabular):
#     to.conts = (to.conts - self.shift_mu) * self.scale_var
#     to.conts = (to.conts*self.stds ) + self.means
#     return to

# # test with Tabular data somehow
# test_data_view = analysis.df.iloc[:100, :10]

# # norm_shifted = Normalize.from_stats(mean=test_data_view.mean(), std=test_data_view.std())
# # norm_shifted = NormalizeShiftedMean.from_stats(mean=test_data_view.mean(), std=test_data_view.std())

# procs = [Normalize, FillMissing(add_col=True)]
# cont_names = list(set(test_data_view))
# to = TabularPandas(test_data_view, procs=procs, cat_names=None,
#                    cont_names=cont_names, y_names=None, splits=None, y_block=None, do_setup=True)
# to.items

### DataLoaders

In [None]:
from fastai.callback.core import Callback

from fastai.data.core import DataLoaders
# from fastai.data.transforms import Normalize

# from fastai.learner import *
from fastai.learner import Learner
from fastai.losses import MSELossFlat


# https://docs.fast.ai/tabular.core.html#FillStrategy
# from fastai.tabular.core import FillMissing
# from fastai.tabular.core import TabularPandas

In [None]:
# revert format
analysis.df_train = analysis.df_train['intensity'].unstack() #undo using `stack`
analysis.df_valid = analysis.df_valid['intensity'].unstack()

Mean and std. dev. from training data

In [None]:
# norm = Normalize.from_stats(analysis.df_train.mean(), analysis.df_valid.std()) # copy interface?
NORMALIZER = Normalize # NormalizeShiftedMean

#### Training data

procs passed to TabluarPandas are handled internally 
  1. not necessarily in order
  2. with setup call (using current training data)

In [None]:
procs = [NORMALIZER, FillMissing(add_col=True)]
cont_names = list(analysis.df_train.columns)

to = TabularPandas(analysis.df_train, procs=procs, cont_names=cont_names)
print("Tabular object:", type(to))

to.items # items reveals data in DataFrame

Better manuelly apply `Transforms` on `Tabluar` type

In [None]:
cont_names = list(analysis.df_train.columns)
to = TabularPandas(analysis.df_train, cont_names=cont_names, do_setup=False)


tf_norm = NORMALIZER()
_ = tf_norm.setups(to) # returns to
tf_fillna = FillMissing(add_col=True)
_ = tf_fillna.setup(to)

print("Tabular object:", type(to))
# _ = (procs[0]).encodes(to)
to.items # items reveals data in DataFrame

Check mean and standard deviation after normalization

In [None]:
to.items.iloc[:, :10].describe()  # not perferct anymore as expected

Mask is added as type bool

In [None]:
to.items.dtypes.value_counts()

with the suffix `_na` where `True` is indicating a missing value replaced by the `FillMissing` transformation

In [None]:
to.cont_names, to.cat_names

In [None]:
assert len(to.valid) == 0

#### Validation data

- reuse training data with different mask for evaluation
- target data is the validation data
    - switch between training and evaluation mode for setting comparison

In [None]:
_df_valid = TabularPandas(analysis.df_valid,cont_names=analysis.df_valid.columns.tolist())
# assert analysis.df_valid.isna().equals(y_valid.items.isna())
_df_valid = tf_norm.encodes(_df_valid)

In [None]:
_df_valid.items.iloc[:,:10].describe()

In [None]:
# Validation dataset
# build validation DataFrame with mask according to validation data
# FillNA values in data as before, but do not add categorical columns (as this is done manuelly)
_valid_df = to.conts  # same data for predictions
_valid_df = _valid_df.join(analysis.df_valid.isna(), rsuffix='_na')  # mask
_valid_df = _valid_df.join(_df_valid.items, rsuffix='_val')  # target
_valid_df

In [None]:
from fastai.tabular.core import TabularPandas
procs = None #[norm, FillMissing(add_col=False)]  # mask is provided explicitly

cont_names = list(analysis.df_train.columns)
cat_names = [f'{s}_na' for s in cont_names]
y_names = [f'{s}_val' for s in cont_names]

splits = None
y_block = None
to_valid = TabularPandas(_valid_df, procs=procs, cat_names=cat_names, cont_names=cont_names,
                         y_names=y_names, splits=splits, y_block=y_block, do_setup=True)
to_valid.items

In [None]:
stats_valid = to_valid.targ.iloc[:, :100].describe()
stats_valid

In [None]:
to_valid.cats # True = training data ("fill_na" transform sets mask to true in training data where values are replaced)

In [None]:
assert list(to_valid.cat_names) == list(
    _valid_df.select_dtypes(include='bool').columns)
assert to_valid.cats.equals(analysis.df_valid.isna().add_suffix('_na'))

### PyTorch DataLoader

- [ ] can they plukked in efficiently as suggested by fastai paper?

In [None]:
# from vaep.transform import ShiftedStandardScaler

# args_ae = {}
# args_ae['SCALER'] = StandardScaler
# args_ae['SCALER'] = ShiftedStandardScaler

# # select initial data: transformed vs not log transformed
# scaler = args_ae['SCALER'](scale_var=2).fit(analysis.df_train)
# # five examples from validation dataset
# scaler.transform(analysis.df_train).describe(percentiles=[0.025, 0.975])

In [None]:
# from torchvision import transforms
# from torch.utils.data import DataLoader
# from vaep.io.datasets import PeptideDatasetInMemoryMasked

# # ToDo: replace with helper class (see below)
# tf_norm = None  # replace with Normalizer

# dataset_train = PeptideDatasetInMemoryMasked(
#     data=scaler.transform(analysis.df_train.values))
# dataset_valid = PeptideDatasetInMemoryMasked(
#     data=scaler.transform(analysis.df_valid.values))
# dl_train = DataLoader(dataset_train, batch_size=args.batch_size, shuffle=True)
# dl_valid = DataLoader(dataset_valid, batch_size=args.batch_size, shuffle=False)

### Mix and match dataloaders

- train dataloader in both TabularPandas objects used
- train dataloader in dataloaders used in both case

In [None]:
args.batch_size
dl_train = to.dataloaders(shuffle_train=True, shuffle=False,
                          bs=args.batch_size).train  # , after_batch=after_batch)
dl_valid = to_valid.dataloaders(
    shuffle_train=False, shuffle=False, bs=args.batch_size).train

In [None]:
dls = DataLoaders(dl_train, dl_valid)
b = dls.train.one_batch()
[x.shape for x in b]  # cat, cont, target

In [None]:
dls = DataLoaders(dl_train, dl_valid)
b = dls.valid.one_batch()
[x.shape for x in b]  # cat, cont, target

### Model

- standard PyTorch Model from before

In [None]:
M = analysis.df_train.shape[-1]
model = vaep_model.Autoencoder(n_features=M, n_neurons=int(
    M/2), last_decoder_activation=None, dim_latent=30)

### Callbacks

- controll training loop
    - set what is data
    - what should be used for evaluation (differs for training and evaluation mode)

In [None]:
class ModelAdapter(Callback):
    """Models forward only expects on input matrix. 
    Apply mask from dataloader to both pred and targets."""

    def __init__(self, p=0.1):
        self.do = torch.nn.Dropout(p=p)  # for denoising AE

    def before_batch(self):
        """Remove cont. values from batch (mask)"""
        # assert self.yb
        mask, data = self.xb  # x_cat, x_cont
        self.learn._mask = mask != 1
        if self.training:
            self.learn.yb = (data[self.learn._mask],)
        # dropout data using median
        self.learn.xb = (self.do(data),)
        #         # could be function handeled by switch set at beginning of evaluation/training?
        #         if self.training:
        #             self.learn.yb = (data[self.learn._mask],)
        #         else:
        #             # self.y is not available at "before_batch" - > why?
        #             self.learn.yb = (self.y[mask],)

    def after_pred(self):
        M = self._mask.shape[-1]

        if not self.training:
            if len(self.yb):
                self.learn.yb = (self.y[self.learn._mask],)
                self.val_targets.append(self.learn.yb[0])
                

        #         self.learn.pred = self.pred.view(-1, M)[self._mask] #is this flat?
        self.learn.pred = self.pred[self._mask] #is this flat?
        if not self.training:
            self.val_preds.append(self.learn.pred)
        
    def before_validate(self):
        "containers for current predictions"
        self.learn.val_preds,self.learn.val_targets = [],[]


### Learner: Fastai Training Loop

In [None]:
learn = Learner(dls=dls, model=model,
                loss_func=MSELossFlat(), cbs=ModelAdapter())

In [None]:
learn.show_training_loop()

In [None]:
learn.summary()

In [None]:
suggested_lr = learn.lr_find()
suggested_lr

#### Train one batch

In [None]:
b = learn.dls.one_batch()
learn.one_batch(0, b)

In [None]:
b

#### Train epochs

In [None]:
learn.fit_one_cycle(epochs_max, lr_max=suggested_lr.valley)

In [None]:
learn.val_preds, learn.val_targets #

In [None]:
learn.recorder.plot_loss()

In [None]:
# L(zip(learn.recorder.iters, learn.recorder.values))


### Evaluation

In [None]:
pred, target = learn.get_preds(act=noop, concat_dim=0, reorder=False) # reorder True: Only 500 predictions returned
len(pred), len(target)  

MSE on transformed data is not too interesting for comparision between models if these use different standardizations

In [None]:
learn.loss_func(pred, target) # MSE in transformed space not too interesting

In [None]:
# check target is in expected order
Y = dls.valid.targ

npt.assert_almost_equal(
    actual=target.numpy(),
    desired=Y.stack().to_numpy()
)

In [None]:
def transform_preds(pred:torch.Tensor, index:pd.Index):
    pred = pd.Series(pred, index).unstack()
    pred = TabularPandas(pred, cont_names=list(pred.columns))
    _ = tf_norm.decode(pred)
    pred = pred.items.stack()
    return pred

df_pred['intensity_pred_dae'] = transform_preds(pred=pred, index=analysis.df_valid.stack().index)
df_pred

## Variational Autoencoder (VAE)

### Transforms - FastAi MinMaxScaler

In [None]:
# X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
# X_scaled = X_std * (max - min) + min

### Scikit Learn MinMaxScaler

- [docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In [None]:
from vaep.transform import MinMaxScaler

args_vae = {}
args_vae['SCALER'] = MinMaxScaler
# select initial data: transformed vs not log transformed
scaler = args_vae['SCALER']().fit(analysis.df_train)
scaler.transform(analysis.df_valid.iloc[:5])

### DataLoaders

- follow instructions for using plain PyTorch Datasets, see [tutorial](https://docs.fast.ai/tutorial.siamese.html#Preparing-the-data)


In [None]:
assert all(analysis.df_train.columns == analysis.df_valid.columns)
if not all(analysis.df.columns == analysis.df_train.columns):
    print("analysis.df columns are not the same as analysis.df_train")
    # ToDo: DataLoading has to be cleaned up
    # analysis.df = analysis.df_train.fillna(analysis.df_valid)

In [None]:
from vaep.io.datasets import PeptideDatasetInMemory

FILL_NA = 0.0

train_ds = PeptideDatasetInMemory(data=scaler.transform(analysis.df_train).to_numpy(dtype=None), fill_na=FILL_NA)
valid_ds = PeptideDatasetInMemory(data=scaler.transform(analysis.df_train.fillna(analysis.df_valid)).to_numpy(dtype=None),
                                  mask=analysis.df_valid.notna().to_numpy(), fill_na=FILL_NA)

assert (train_ds.peptides == valid_ds.peptides).all()

In [None]:
# import importlib; importlib.reload(vaep.io.datasets)
# from vaep.io.datasets import PeptideDatasetInMemoryMasked
# from vaep.io.dataloaders import DataLoadersCreator

# # ToDo: Change interface to np.array
# data_loader_creator = DataLoadersCreator(
#     df_train=analysis.df_train.to_numpy(),
#     df_valid=analysis.df_valid.to_numpy(),
#     scaler=scaler,
#     DataSetClass=PeptideDatasetInMemoryMasked,
#     batch_size=args.batch_size)

# dl_train, dl_valid = data_loader_creator.get_dls(shuffle_train=True)

# logger.info(
#     "N train: {:5,d} \nN valid: {:5,d}".format(
#         len(dl_train.dataset), len(dl_valid.dataset))
# )

In [None]:
# # Could be changed in DataSet Class
#train_ds.encodes, valid_ds.encodes = train_ds.__getitem__, valid_ds.__getitem__

In [None]:
# train_ds, valid_ds = data_loader_creator.data_train, data_loader_creator.data_valid

In [None]:
dls = DataLoaders.from_dsets(train_ds, valid_ds, n_inp=2)

### Model

In [None]:
from torch.nn import Sigmoid

M = analysis.df_train.shape[-1]
model = vaep_model.VAE(n_features=M, n_neurons=int(
    M/2), last_encoder_activation=None, last_decoder_activation=Sigmoid, dim_latent=30)

### Training

In [None]:
class ModelAdapterVAE(Callback):
    """Models forward only expects on input matrix. 
    Apply mask from dataloader to both pred and targets."""


    def before_batch(self):
        """Remove cont. values from batch (mask)"""
        # assert self.yb
        data, mask = self.xb
        self.learn._mask = mask # mask is True for non-missing, measured values
        # dropout data using median
        self.learn.xb = (data,)


    def after_pred(self):
        M = self._mask.shape[-1]
#         self.learn.yb = (self.y[self.learn._mask],)
#         if not self.training:
        if len(self.yb):
#             breakpoint()
            self.learn.yb = (self.y[self.learn._mask],)
# #                 self.val_targets.append(self.learn.yb[0])
                

        pred, mu, logvar = self.pred # return predictions
        self.learn.pred = (pred[self._mask], mu, logvar) #is this flat?
#         if not self.training:
#             self.val_preds.append(self.learn.pred)

        

In [None]:
# https://docs.fast.ai/losses.html#CrossEntropyLossFlat
# from fastai.losses import CrossEntropyLossFlat
from torch.nn import functional as F


#self.loss_func(self.pred, *self.yb)
def loss_fct_vae(pred, y):
    recon_batch, mu, logvar = pred
    batch = y
    res = vaep_model.loss_function(recon_batch=recon_batch, batch=batch, mu=mu, logvar=logvar, reconstruction_loss=F.binary_cross_entropy)
    return res['loss']

learn = Learner(dls=dls,
                model=model,
                loss_func=loss_fct_vae,
                cbs=ModelAdapterVAE())

learn.show_training_loop(); learn.summary()

In [None]:
suggested_lr = learn.lr_find()
suggested_lr

#### Train epochs

In [None]:
learn.fit_one_cycle(epochs_max, lr_max=suggested_lr.valley)

### Evaluation

In [None]:
pred, target = learn.get_preds(act=noop, concat_dim=0, reorder=False) # reorder True: Only 500 predictions returned
len(pred), len(target)  

In [None]:
len(pred[0])

In [None]:
learn.loss_func(pred, target)

In [None]:
_pred = pd.Series(pred[0], index=analysis.df_valid.stack().index).unstack()
_pred = scaler.inverse_transform(_pred).stack()

df_pred['intensity_pred_vae'] = _pred
df_pred

## Compare the 3 models

- replicates: replace NAs with neighbouring ("close") values
- train average, median: Replace NA with average or median from training data

In [None]:
import sklearn.metrics as sklm
pred_columns = df_pred.columns[1:]
scoring =     [  ('MSE',sklm.mean_squared_error),
                 ('MAE',sklm.mean_absolute_error)]

y_true = df_pred['intensity']

metrics = {}
for col in pred_columns:
    metrics[col] = dict(
        [(k, f(y_true=y_true, y_pred=df_pred[col])) for k, f in scoring]
    )

metrics = pd.DataFrame(metrics)
metrics.to_csv(config.FOLDER_DATA / f'exp_02_metrics.csv', float_format='{:.3f}'.format, sep=';')
metrics.sort_values(by=[k for k, f in scoring],axis=1)

Save final prediction values of validation data for later comparison.

In [None]:
df_pred.to_csv(config.FOLDER_DATA / f"{config.FOLDER_DATA}_valid_pred.csv")

## PCA plot for imputed and denoised data

two setups:
 - impute missing values
 - additinally change observed values

In [None]:
# temporary small check

metrics_expected = {'train_median': {'MSE': 1.5363473381307673, 'MAE': 0.8247024248330177},
 'train_average': {'MSE': 1.4990057373378602, 'MAE': 0.8401609089063766},
 'replicates': {'MSE': 1.1866650348310341, 'MAE': 0.7219400315371745},
 'intensity_pred_collab': {'MSE': 0.3324973869128017,
  'MAE': 0.3102820498107155},
 'intensity_pred_dae': {'MSE': 0.44876634948273664, 'MAE': 0.4020490145341609},
 'intensity_pred_vae': {'MSE': 0.6307470256264553, 'MAE': 0.5060164227862353}}
metrics_expected = pd.DataFrame(metrics_expected)
assert metrics_expected.equals(metrics), metrics_expected.compare(metrics)