Skip to content

CASP Secondary Structure Property Datasets: ETL pipelines for producing Critical Assessment of Protein Structure Prediction (CASP) secondary structure property datasets for AI models or analysis

Notifications You must be signed in to change notification settings

Eryk96/CASP-Datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CASP Secondary Structure Property Datasets

ETL pipelines for producing Critical Assessment of Protein Structure Prediction (CASP) secondary structure feature datasets for AI models and analysis

Getting started

You can either download the datasets or run the project. The current datasets does only include PDB entries that are from the FM (free modelling) category. If you want to build datasets that include additional classifications you can edit the class_filter from the config files.

Requirements

  • Python 3.8+

Installation

Clone the directory and create a virtual environment inside the repository.

$ git clone <repository>\
$ python -m venv venv

Activate the virtual environment

$ source venv/bin/activate

Install dependencies and the casp

$ python setup.py install
$ pip install -r requirements.txt

Run the desired configuration.

$ casp run -c config/casp14.yml

CASP Datasets

The datasets are found in the data folder. They are organized by the competition number. In each of the folders there is a dataset of the FM domain CASP protein that have been merged into a single dataset. Additionally you can find DSSP files entries, domain summary and fasta files of the entries.

CASP14 \ CASP13 \ CASP12 \ CASP11 \ CASP10 \

About

CASP Secondary Structure Property Datasets: ETL pipelines for producing Critical Assessment of Protein Structure Prediction (CASP) secondary structure property datasets for AI models or analysis

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages