# AutoPeptideML API Python

## 1. Introduction

The functionalities of AutoPeptideML Python API is focused in a single class, `AutoPeptideML`. Initialization of the class includes 3 possible arguments:

- `verbose`: boolean value. Default: `True`.
- `threads`: number of threads to use for multithreading. By default it uses all available CPU cores.
- `seed`: seed for pseudo-random number generator for all stochastic processes. Default: `42`.

In [None]:
from autopeptideml.autopeptideml import AutoPeptideML

apml = AutoPeptideML(
    verbose=True,
    threads=8,
    seed=42
)

## 2. Dataset preparation

There are 3 methods to handle dataset preparation:

- `autosearch_negatives`: Searches for negative bioactive peptides
    - `df_pos`: `pd.DataFrame` with positive samples
    - `positive_tags`: `List[str]` with all bioactivities that may overlap with the positive class
    - `proportion`: `float` number. Target negative:positive ratio. Default:  `1.0`.
- `balance_samples`: Balances labels in the dataset by oversampling the underepresented classes.
    - `df`: `pd.DataFrame`. Dataframe with `Y` column, for which labels will be balanced.
- `curate_dataset`: Load the dataset, remove non-canonical and empty sequences.
    - `dataset`: `Union[str, pd.DataFrame]`. The input can be either the path to a `.fasta`, `.csv`, or `.tsv` file or a `pd.DataFrame`.
    - `outputdir`: `str`. Path to a directory where to save the curated dataset.


In [None]:
# Dataset curation
df_negs = apml.curate_dataset(
    dataset='example_dataset_with_negatives.fasta',
    output='output_dir'
)
df_pos = apml.curate_dataset(
    dataset='example_dataset_with_positives.fasta',
    output='output_dir_2'
)

# Balance samples_to_draw (only if df contains negative samples)
df_negs_balanced = apml.balance_samples(df_negs)

# Autosearch for negatives
df = apml.autosearch_negatives(
    df_pos=df_pos,
    positive_tags=['Neuropeptides'],
    proportion=1.0
)

## 3. Dataset partitioning

There are two steps of dataset partitioning: training/evaluation and training/validation folds.

- `train_test_partition`: Creates training/evaluation sets using novel homology partitioning algorithm
    - `df`: `pd.DataFrame`
    - `threshold`: `float`. Maximum sequence identity value between sequences in training and evaluation sets. Default: `0.3`
    - `test_size`: `float`. Proportion of samples that should comprise the evaluation set. Default: `0.2`
    - `alignment`: `str`. Alignment method to be used. Options: `needle`, `mmseqs` and `mmseqs+prefilter`. Default: `mmseqs+prefilter`
    - `outputdir`: `str`. Path to a directory where to save the generated datasets.
- `train_val_partition`: Creates n training/validation folds
    - `df`: `pd.DataFrame`. Should be the training dataset generated with the previous step.
    - `method`: `str`. Method for partitioning. Options: `random` and `graph-part`. `random` refers to `StratifiedKFold` from `sklearn.model_selection` and `graph-part` to `stratified_k_fold` from the GraphPart algorithm. For more details see the [Project Github Repository](https://github.com/graph-part/graph-part).
    - `threshold`: `float`. Maximum sequence identity value between sequences in training and valdation folds. Only valid if method is `graph-part`. Default: `0.5`.
    - `alignment`: `str`. Alignment method to be used. Options: `needle`, `mmseqs` and `mmseqs+prefilter`. Only valid if method is `graph-part`. Default: `mmseqs+prefilter`.
    - `n_folds`: `int`. Number of folds to be generated. Default: `10`.
    - `outputdir`: `str`. Path to a directory where to save the generated datasets.
    

In [None]:
datasets = apml.train_test_partition(
    df=df,
    threshold=0.3,
    test_size=0.2,
    alignment='mmseqs+prefilter',
    outputdir='outputdir/splits'
)
folds = apml.train_val_partition(
    df=datasets['train'],
    method='random',
    n_folds=10,
    outputdir='outputdir/folds'
)

## 4. Peptide Representation

The Peptide Representation step requires an additional class within the AutoPeptideML package, `RepresentationEngine`, that loads the Protein Language Model (PLM) of choice.

- `RepresentationEngine`:
    - `model`: `str`. Protein Language Model, see Github Repo `README.md` file. Default: `esm2-8m`
    - `batch_size`: Number of peptide sequences to compute in each batch, depends on the RAM memory either in the CPU or the GPU. Default: `64`.
- `AutoPeptideML`:
    - `compute_representation`: Uses the `RepresentationEngine` class to compute the representations in the dataset.
        - `datasets`: `Dict[str, pd.DataFrame]` dictionary with the dataset partitions
        - `re`: `RepresentationEngine`



In [None]:
from autopeptideml.utils.embeddings import RepresentationEngine

re = RepresentationEngine(
    model='esm2-8m',
    batch_size=64
)
id2rep = apml.compute_representations(
    datasets=datasets,
    re=re
)

## 5. Hyperparameter Optimisation and Model Training

- `hpo_train`
    - `config`: `dict`. `JSON` file with the hyperparameter search space, for examples of the format please refer to the files in `autopeptideml/data/configs`.
    - `train_df`: `pd.DataFrame` with the training dataset.
    - `id2rep`: `dict`. Result from running `apml.compute_representation`
    - `folds`: `list`. List of training/validation folds.
    - `outputdir`: `str`. Path to a directory where to save the results.

In [None]:
model = apml.hpo_train(
    config=json.load(open('../autopeptideml/data/config/default_config.json')),
    train_df=datasets['train],
    id2rep=id2rep,
    folds=folds,
    outputdir='outputdir/ensemble'
)

## 6. Ensemble Evaluation

- `evaluate_model`
    - `best_model`. Ensemble generated in previous step.
    - `test_df`: `pd.DataFrame` with the evaluation set.
    - `id2rep`: `dict`. Representations generated in Step 4
    - `outputdir`: `str`.


In [None]:
results = apml.evaluate_model(
    best_model=model,
    test_df=datasets['test'],
    id2rep=id2rep,
    outputdir='outputdir/results'
)

## 7. Prediction

- `predict`: Predict the bioactivity of a set of peptide sequences given an ensemble already trained.
    - `df`: `pd.DataFrame` with the peptide sequences.
    - `re`: `RepresentationEngine`
    - `ensemble_path`: Path where the ensemble files were saved.
    - `outputdir`

In [None]:
apml.predict(
    df=pd.read_csv('New_samples.csv'),
    re=re,
    ensemble_path='outputdir/ensemble',
    outputdir='prediction'
)