Hestia

Computational tool for generating generalisation-evaluating evaluation sets.

Documentation: https://ibm.github.io/Hestia-OOD
Source Code: https://github.com/IBM/Hestia-OOD
Webserver: http://peptide.ucd.ie/Hestia
Paper Pre-print: https://www.biorxiv.org/content/10.1101/2024.03.14.584508v1

Installation

Installing in a conda environment is recommended. For creating the environment, please run:

conda create -n autopeptideml python
conda activate autopeptideml

1. Python Package

1.1.From PyPI

pip install hestia-ood

1.2. Directly from source

pip install git+https://github.com/IBM/Hestia-OOD

3. Third-party dependencies

For using MMSeqs as alignment algorithm is necessary install it in the environment:

conda install -c bioconda mmseqs2

For using Needleman-Wunch:

conda install -c bioconda emboss

If installation not in conda environment, please check installation instructions for your particular device:

Linux:

wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
tar xvfz mmseqs-linux-avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH

sudo apt install emboss

sudo apt install emboss

Windows: Download binaries from EMBOSS and MMSeqs2-latest

Mac:

sudo port install emboss
brew install mmseqs2

Documentation

1. DatasetGenerator

The HestiaDatasetGenerator allows for the easy generation of training/validation/evaluation partitions with different similarity thresholds. Enabling the estimation of model generalisation capabilities. It also allows for the calculation of the ABOID (Area between the similarity-performance curve (Out-of-distribution) and the In-distribution performance).

from hestia.dataset_generator import HestiaDatasetGenerator, SimilarityArguments

# Initialise the generator for a DataFrame
generator = HestiaDatasetGenerator(df)

# Define the similarity arguments (for more info see the documentation page https://ibm.github.io/Hestia-OOD/datasetgenerator)
args = SimilarityArguments(
    data_type='protein', field_name='sequence',
    similarity_metric='mmseqs2+prefilter', verbose=3,
    save_alignment=True
)

# Calculate the similarity
generator.calculate_similarity(args)

# Load pre-calculated similarities
generator.load_similarity(args.filename + '.csv.gz')

# Calculate partitions
generator.calculate_partitions(min_threshold=0.3,
                               threshold_step=0.05,
                               test_size=0.2, valid_size=0.1)

# Save partitions
generator.save_precalculated('precalculated_partitions.gz')

# Load pre-calculated partitions
generator.from_precalculated('precalculated_partitions.gz')

# Training code
# ...

# Calculate ABOID

generator.calculate_aboid(results, 'test_mcc')

# Plot ABOID
generator.plot_aboid(results, 'test_mcc')

2. Similarity calculation

Calculating pairwise similarity between the entities within a DataFrame df_query or between two DataFrames df_query and df_target can be achieved through the calculate_similarity function:

from hestia.similarity import calculate_similarity
import pandas as pd

df_query = pd.read_csv('example.csv')

# The CSV file needs to have a column describing the entities, i.e., their sequence, their SMILES, or a path to their PDB structure.
# This column corresponds to `field_name` in the function.

sim_df = calculate_similarity(df_query, species='protein', similarity_metric='mmseqs+prefilter',
                              field_name='sequence')

More details about similarity calculation can be found in the Similarity calculation documentation.

3. Clustering

Clustering the entities within a DataFrame df can be achieved through the generate_clusters function:

from hestia.similarity import calculate_similarity
from hestia.clustering import generate_clusters
import pandas as pd

df = pd.read_csv('example.csv')
sim_df = calculate_similarity(df, species='protein', similarity_metric='mmseqs+prefilter',
                              field_name='sequence')
clusters_df = generate_clusters(df, field_name='sequence', sim_df=sim_df,
                                cluster_algorithms='CDHIT')

There are three clustering algorithms currently supported: CDHIT, greedy_cover_set, or connected_components. More details about clustering can be found in the Clustering documentation.

4. Partitioning

Partitioning the entities within a DataFrame df into a training and an evaluation subsets can be achieved through 4 different functions: ccpart, graph_part, reduction_partition, and random_partition. An example of how cc_part would be used is:

from hestia.partition import ccpart
import pandas as pd

df = pd.read_csv('example.csv')
train, test = cc_part(df, species='protein', similarity_metric='mmseqs+prefilter',
                      field_name='sequence', threshold=0.3, test_size=0.2)

train_df = df.iloc[train, :]
test_df = df.iloc[test, :]

License

Hestia is an open-source software licensed under the MIT Clause License. Check the details in the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
docs		docs
hestia		hestia
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

docs

docs

hestia

hestia

.gitignore

.gitignore

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

LICENSE

LICENSE

README.md

README.md

mkdocs.yml

mkdocs.yml

setup.py

setup.py

Repository files navigation

Hestia

Contents

Installation

1. Python Package

1.1.From PyPI

1.2. Directly from source

3. Third-party dependencies

Documentation

1. DatasetGenerator

2. Similarity calculation

3. Clustering

4. Partitioning

License

About

Releases 7

Packages

Contributors 2

Languages

License

IBM/Hestia-OOD

Folders and files

Latest commit

History

Repository files navigation

Hestia

Contents

Installation

1. Python Package

1.1.From PyPI

1.2. Directly from source

3. Third-party dependencies

Documentation

1. DatasetGenerator

2. Similarity calculation

3. Clustering

4. Partitioning

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages