Open-source collection of biology datasets and pre-trained embeddings. 🧬 📕
bio-datasets is a collaborative framework that allows the user to fetch publicly available sequence-based protein datasets. For these datasets, pre-trained contextual embeddings are also available.
Install the required dependencies with pip install bio-datasets
.
from biodatasets import list_datasets, load_dataset
print(list_datasets())
# Load your dataset
pathogen = load_dataset("pathogen")
# Display the available columns and embeddings
print(pathogen)
# Get data from your dataset
X, y = pathogen.to_npy_arrays(input_names=["sequence"], target_names=["class"])
embeddings = pathogen.get_embeddings("sequence", "protbert", "cls")
# Get a full description of your dataset
pathogen.display_description()
Check out how to setup the project or add a public dataset in CONTRIBUTING.md.