# `ABaCo` tutorial: Anaerobic Digestion benchmark dataset

In this notebook we will implement ABaCo for batch correction on an Anaerobic Digestion (AD) dataset under different phenol concentration conditions. 

-----
**Data Description:**
- The AD dataset generated by ([Chapleur et al., 2016](https://doi.org/10.1007/s10532-015-9751-4)) and shared in R package [PLSDAbatch](https://github.com/EvaYiwenWang/PLSDAbatch) by [Wang and Cao (2023)](https://doi.org/10.1093/bib/bbac622) is a dataset composed of 75 samples with 567 identified taxonomic groups. 
- The samples were treated with two different phenol concentrations, accounting for the biological source of variation. 
- Samples were processed on 5 different dates over the lapse of 2 years, accounting for the technical source of variation (i.e., batch effect).

**Goal:**

With ABaCo the aim is to remove the technical variation (batch effect) while retaining the biological variation (phenol group effect). 

-----

## Dataset Requirements for ABaCo

The dataset should contain the following to be compatible with the ABaCo framework:

| sample | batch      | trt  | feat1 | feat2 | ... |
|--------|------------|------|-------|-------|-----|
| A      | 24/07/2025 | low  | #     | #     | ... |
| B      | 15/06/2024 | low  | #     | #     | ... |
| C      | 24/07/2025 | high | #     | #     | ... |

- The data must have 3 categorical columns: 
    1. unique ids to identify the observations/samples e.g. sample ids `sample`
    2. ids for the batch/factor groupings to be corrected by abaco. e.g. dates of sample analysis `batch`
    3. biological/experimental factor variation for abaco to retain when correcting batch effect e.g., phenol concentration condition `trt`

- And the features (numeric type) to be trained on. 

`abaco` provides a `BatchEffectDataLoader.dataPreprocess()` function to help convert a plain text file (e.g, csv, tsv) into a compatible pd.DataFrame format, and a `onehotencoding()` function to encode categorical columns. 

In [1]:
from abaco.BatchEffectDataLoader import DataPreprocess, one_hot_encoding

# Load AD count
path_to_dataset = "data/dataset_ad.csv"
batch_col = "batch"
id_col = "sample"
bio_col = "trt"

# Convert data path into compatible pd.DataFrame with ABaCo framework
df_ad = DataPreprocess(
    path_to_dataset,
    factors=[
        batch_col, 
        id_col, 
        bio_col
    ]
)

# see that there are 3 categorical and n numeric columns
print(df_ad.info())

# Isolating the features (numeric) to be inputed into the model
df_features = df_ad.drop(columns=[batch_col, id_col, bio_col]).values
print(f"\n The feature inputs have shape: {df_features.shape}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 570 entries, sample to Cluster_17710
dtypes: category(3), int64(567)
memory usage: 335.5 KB
None

 The feature inputs have shape: (75, 567)


In [2]:
# One-hot encode the categorical columns
# the batch factor
ohe_batches, ohe_batch_classes = one_hot_encoding(df_ad[batch_col])
# and the biological factor
ohe_bio, ohe_bio_classes = one_hot_encoding(df_ad[bio_col])

print(f"Batch classes: {ohe_batch_classes}\n")
print(f"Biological classes: {ohe_bio_classes}")

Batch classes: ['09/04/2015', '14/04/2016', '01/07/2016', '14/11/2016', '21/09/2017']

Biological classes: ['0-0.5', '1-2']


### The AD dataset in brief
`abaco` also provides plotting functions to easily explore the data. Here we use `BatchEffectPlots.plotPCoA()` to visualize the batch and biological effects on the data.

In [8]:
from abaco.BatchEffectPlots import plotPCoA

# pcoa
plotPCoA(
    data=df_ad, 
    sample_label=id_col, 
    batch_label=batch_col, 
    experiment_label=bio_col,
)

- Batch effect (colours): 
    - Samples were processed on 5 different dates over the lapse of 2 years, accounting for the technical source of variation 
    - The points clustering by colour demonstrate that there is a batch effect.
- Biological effect (shapes): 
    - The samples were treated with 2 different phenol concentrations, accounting for the biological source of variation. 
    - Within the batches the circle points (lower phenol conc.) are to the left of the square points (higher conc).


### The goal 

The aim of **ABaCo** is to: 
1) correct the batch effect (e.g., the points should no longer cluster by colour in the PCoA) while
2) maintaining biological variance (e.g., keep circles and squares separated).

Ideally, after using AbaCo to transform the data, the resulting PCoA will look like a colourful mixture of points, where the circle points remain to the left of the square ones.


---

## Data Preparation for PyTorch

**ABaCo** is implemented using the [Pytorch ecosystem](https://docs.pytorch.org/docs/stable/index.html). 

Following the typical [workflow](https://docs.pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) we will use their [`torch.utils.data.DataLoader` class](https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) to easily iterate over the dataset in batches. 

In [4]:
import torch
from torch.utils.data import DataLoader, TensorDataset

# Convert the features into a tensor
features_tensor = torch.tensor(df_features, dtype=torch.float32)

# Construct DataLoader
dataloader = DataLoader(
    TensorDataset(
        features_tensor, # input features
        ohe_batches, # one hot encoded batch information
        ohe_bio, # one hot encoded biological information
    ),
    batch_size=100, # number of samples per batch, so one batch since 75 samples in AS
)


Now the data is prepped and ready for `ABaCo` !

------

# Using `ABaCo`

## Training the ABaCo model

To train ABaCo on the prepared AD dataset, we use `abaco.ABaCo.abaco_run()` and pass the required parameters shown in the cell below. Thus, running abaco with the default optional parameters.

Usually, setup of the parameters is required, which are explained in brief in the documentation e.g. `help(abaco_run)`

In [5]:
from abaco.ABaCo import abaco_run

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Run ABaCo
abaco_model = abaco_run(
    dataloader = dataloader,
    n_batches = len(ohe_batch_classes),  # number of batches
    n_bios = len(ohe_bio_classes),  # number of biological groups
    input_size = df_features.shape[1],  # number of features
    device = device, # use gpu if available else cpu
)

Pre-training: VAE for reconstructing data and batch mixing adversarial training: 100%|██████████| 2000/2000 [01:06<00:00, 29.86it/s, adv=-1.5661, contra=38.5550, disc=1.5724, elbo=538.3237, epoch=1999/2001]
Training: VAE decoder with masked batch labels: 100%|██████████| 2000/2000 [09:41<00:00,  3.44it/s, epoch=2000/2000, vae_loss=636.1075]


In [6]:
help(abaco_run)

Help on function abaco_run in module abaco.ABaCo:

abaco_run(dataloader: torch.utils.data.dataloader.DataLoader, n_batches: int, n_bios: int, device: torch.device, input_size: int, new_pre_train: bool = False, seed: int = 42, d_z: int = 16, prior: str = 'VMM', count: bool = True, pre_epochs: int = 2000, post_epochs: int = 2000, kl_cycle: bool = True, smooth_annealing: bool = False, encoder_net: list = [1024, 512, 256], decoder_net: list = [256, 512, 1024], vae_act_func=ReLU(), disc_net: list = [256, 128, 64], disc_act_func=ReLU(), disc_loss_type: str = 'CrossEntropy', w_elbo: float = 1.0, beta: float = 20.0, w_disc: float = 1.0, w_adv: float = 1.0, w_contra: float = 10.0, temp: float = 0.1, w_cycle: float = 0.1, vae_pre_lr: float = 0.001, vae_post_lr: float = 0.0001, disc_lr: float = 1e-05, adv_lr: float = 1e-05)
    Function to run the ABaCo model training.
    
    Parameters
    ----------
    dataloader: torch.utils.data.DataLoader
        Pytorch dataLoader for the training data.


After training the abaco model we can use it to reconstruct the AD dataset without the batch variance while keeping the biological variance.

-----

## ABaCo data reconstruction

To reconstruct the dataset we use `abaco.ABaCo.abaco_recon()`. The reconstruction can be done with or without (recommended) Monte Carlo reconstruction setup.

In [9]:
from abaco.ABaCo import abaco_recon

# Reconstruct the dataset using the trained ABaCo model
corrected_dataset = abaco_recon(
    model=abaco_model,
    device=device,
    data=df_ad,
    dataloader=dataloader,
    sample_label=id_col,
    batch_label=batch_col,
    bio_label=bio_col,
    seed=42,
    monte_carlo=1, # without Monte Carlo reconstruction
)

# Plot the PCoA of the reconstructed dataset
plotPCoA(
    data = corrected_dataset, 
    sample_label=id_col, 
    batch_label=batch_col, 
    experiment_label=bio_col
)

### Conclusion

The goal was to: 

&#x2705; correct the batch effect (reduce clustering of points by colour)

&#x2705; maintain biological variance (maintain separation of circles and sequares)

A brief visual inspection of the PCoA of the reconstructed AD data suggests that ABaCo reduced the batch effect associated with processing the samples on different days, while still retaining the variance due to the experimental condition of lower vs higher phenol concentration.


---