# `ABaCo` tutorial: Anaerobic Digestion benchmark dataset

In this notebook we will implement ABaCo for batch correction on an Anaerobic Digestion (AD) dataset under different phenol concentration conditions. 

-----
**Data Description:**
- The AD dataset generated by ([Chapleur et al., 2016](https://doi.org/10.1007/s10532-015-9751-4)) and shared in R package [PLSDAbatch](https://github.com/EvaYiwenWang/PLSDAbatch) by [Wang and Cao (2023)](https://doi.org/10.1093/bib/bbac622) is a dataset composed of 75 samples with 567 identified taxonomic groups. 
- The samples were treated with two different phenol concentrations, accounting for the biological source of variation. 
- Samples were processed on 5 different dates over the lapse of 2 years, accounting for the technical source of variation (i.e., batch effect).

**Goal:**

With ABaCo the aim is to remove the technical variation (batch effect) while retaining the biological variation (phenol group effect). 

-----

## Dataset Requirements for ABaCo

The dataset should contain the following to be compatible with the ABaCo framework:

| sample | batch      | trt  | feat1 | feat2 | ... |
|--------|------------|------|-------|-------|-----|
| A      | 24/07/2025 | low  | #     | #     | ... |
| B      | 15/06/2024 | low  | #     | #     | ... |
| C      | 24/07/2025 | high | #     | #     | ... |

- The data must have 3 categorical columns: 
    1. unique ids to identify the observations/samples e.g. sample ids `sample`
    2. ids for the batch/factor groupings to be corrected by abaco. e.g. dates of sample analysis `batch`
    3. biological/experimental factor variation for abaco to retain when correcting batch effect e.g., phenol concentration condition `trt`

- And the features (numeric type) to be trained on. 

`abaco` provides a `BatchEffectDataLoader.dataPreprocess()` function to help convert a plain text file (e.g, csv, tsv) into a compatible pd.DataFrame format, and a `onehotencoding()` function to encode categorical columns. 

In [1]:
from abaco.BatchEffectDataLoader import DataPreprocess, one_hot_encoding

# Load AD count
path_to_dataset = "data/dataset_ad.csv"
batch_col = "batch"
id_col = "sample"
bio_col = "trt"

# Convert data path into compatible pd.DataFrame with ABaCo framework
df_ad = DataPreprocess(
    path_to_dataset,
    factors=[
        batch_col, 
        id_col, 
        bio_col
    ]
)

# see that there are 3 categorical and n numeric columns
print(df_ad.info())

# Isolating the features (numeric) to be inputed into the model
df_features = df_ad.drop(columns=[batch_col, id_col, bio_col]).values
print(f"\n The feature inputs have shape: {df_features.shape}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Columns: 570 entries, sample to Cluster_17710
dtypes: category(3), int64(567)
memory usage: 335.5 KB
None

 The feature inputs have shape: (75, 567)


In [2]:
# One-hot encode the categorical columns
# the batch factor
ohe_batches, ohe_batch_classes = one_hot_encoding(df_ad[batch_col])
# and the biological factor
ohe_bio, ohe_bio_classes = one_hot_encoding(df_ad[bio_col])

print(f"Batch classes: {ohe_batch_classes}\n")
print(f"Biological classes: {ohe_bio_classes}")

Batch classes: ['09/04/2015', '14/04/2016', '01/07/2016', '14/11/2016', '21/09/2017']

Biological classes: ['0-0.5', '1-2']


### The AD dataset in brief
`abaco` also provides plotting functions to easily explore the data. Here we use `BatchEffectPlots.plotPCoA()` to visualize the batch and biological effects on the data.

In [3]:
from abaco.BatchEffectPlots import plotPCoA

# pcoa
plotPCoA(
    data=df_ad, 
    sample_label=id_col, 
    batch_label=batch_col, 
    experiment_label=bio_col,
)

>> clustergrammer2 backend version 0.18.0


- Batch effect (colours): 
    - Samples were processed on 5 different dates over the lapse of 2 years, accounting for the technical source of variation 
    - The points clustering by colour demonstrate that there is a batch effect.
- Biological effect (shapes): 
    - The samples were treated with 2 different phenol concentrations, accounting for the biological source of variation. 
    - Within the batches the circle points (lower phenol conc.) are to the left of the square points (higher conc).


### The goal 

The aim of **ABaCo** is to: 
1) correct the batch effect (e.g., the points should no longer cluster by colour in the PCoA) while
2) maintaining biological variance (e.g., keep circles and squares separated).

Ideally, after using AbaCo to transform the data, the resulting PCoA will look like a colourful mixture of points, where the circle points remain to the left of the square ones.


---

# Using `ABaCo`

## Training the ABaCo model

To train ABaCo on the prepared AD dataset, we create the `abaco.metaABaCo()` class and pass the required parameters shown in the cell below. Thus, running abaco with the default optional parameters.

Usually, setup of the parameters is required, which are explained in brief in the documentation e.g. `help(metaABaCo)`

In [7]:
from abaco.ABaCo import metaABaCo
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create ABaCo model
model = metaABaCo(
    data=df_ad, # Pre-processed dataframe
    n_bios=2, # Number of biological groups in the data
    bio_label=bio_col, # Column where biological groups are labeled in the dataframe
    n_batches=5, # Number of batch groups in the data
    batch_label=batch_col, # Column where batch groups are labeled in the dataframe
    n_features=567, # Number of features (taxonomic groups)
    prior="MoG", # Prior distribution 
    device=device, # Device
)

In [8]:
help(metaABaCo)

Help on class metaABaCo in module abaco.ABaCo:

class metaABaCo(torch.nn.modules.module.Module)
 |  metaABaCo(data, n_bios, bio_label, n_batches, batch_label, n_features, device, prior='MoG', pdist='ZINB', d_z=16, epochs=[1000, 2000, 2000], encoder_net=[512, 256, 128], decoder_net=[128, 256, 512], vae_act_fun=ReLU(), disc_net=[128, 64], disc_act_fun=ReLU())
 |
 |  Method resolution order:
 |      metaABaCo
 |      torch.nn.modules.module.Module
 |      builtins.object
 |
 |  Methods defined here:
 |
 |  __init__(self, data, n_bios, bio_label, n_batches, batch_label, n_features, device, prior='MoG', pdist='ZINB', d_z=16, epochs=[1000, 2000, 2000], encoder_net=[512, 256, 128], decoder_net=[128, 256, 512], vae_act_fun=ReLU(), disc_net=[128, 64], disc_act_fun=ReLU())
 |      Function to create the metaABaCo model.
 |
 |      Parameters
 |      ----------
 |      data: pd.DataFrame
 |          Pre-processed DataFrame to correct. Only feature columns to correct should be of numerical data ty

For training the model we can call the method `correct()` part of the `metaABaCo()` class with the default parameters.

In [9]:
model.correct(seed=42)

Training: VAE for learning meaningful embeddings: 100%|██████████| 1000/1000 [00:14<00:00, 68.62it/s, bio_penalty=0.0149, clustering_loss=0.0262, elbo=468.4863, epoch=999/1001, vae_loss=468.5274]
Training: Embeddings batch effect correction using adversrial training: 100%|██████████| 2000/2000 [00:36<00:00, 54.78it/s, adv_loss=-1.5702, bio_penalty=0.0105, clustering_loss=0.0262, disc_loss=1.5702, elbo=431.7645, epoch=1999/2001, vae_loss=431.8013]
Training: VAE decoder with masked batch labels: 100%|██████████| 2000/2000 [00:14<00:00, 141.42it/s, cycle_loss=0.0000, epoch=2000/2000, vae_loss=1294.9338]


After training the abaco model we can use it to reconstruct the AD dataset without the batch variance while keeping the biological variance.

-----

## ABaCo data reconstruction

To reconstruct the dataset we use the method `reconstruct()` from the `metaABaCo()` class.

In [12]:
# Reconstruct the dataset using the trained ABaCo model
corrected_dataset = model.reconstruct(seed=42)

# Plot the PCoA of the reconstructed dataset
plotPCoA(
    data = corrected_dataset, 
    sample_label=id_col, 
    batch_label=batch_col, 
    experiment_label=bio_col
)

### Conclusion

The goal was to: 

&#x2705; correct the batch effect (reduce clustering of points by colour)

&#x2705; maintain biological variance (maintain separation of circles and sequares)

A brief visual inspection of the PCoA of the reconstructed AD data suggests that ABaCo reduced the batch effect associated with processing the samples on different days, while still retaining the variance due to the experimental condition of lower vs higher phenol concentration.


---