cazy_webscraper
is a Python3 package for the automated retrieval of Carbohydrate-Active enZyme (CAZyme) data from the CAZy database. This program is free to use under the MIT license, and we kindly request that, if you use this program or Python package, you cite it as indicated below.
Hobbs, Emma E. M.; Pritchard, Leighton; Chapman, Sean; Gloster, Tracey M. (2021): cazy_webscraper Microbiology Society Annual Conference 2021 poster. figshare. Poster. https://doi.org/10.6084/m9.figshare.14370860.v7
cazy_webscraper
retrieves data from CAZy, writing it to a local SQLite3 file (typically taking 10-15 minutes to scrape the entirety of CAZy).
Additionally, ``cazy_webscraper`` can:
Retrieve the protein data from UniProt for CAZymes in the local database. This data includes:
- UniProt accession
- Protien name
- Protein amino acid sequence
- EC numbers
- PDB accessions
- Retrieve protein sequences from NCBI GenBank for CAZymes in the local database.
- Write out protein sequences retrieved from UniProt and NCBI in FASTA format, and build a local BLAST database.
- Retrieve protein structures from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank, PDB, for CAZymes in the local database.
- Be configured to scrape the entire CAZy database, or recover only CAZymes filtered by user-supplied criteria, such as CAZy classes, CAZy (sub)family, or taxonomy.
- Retrieve the latest taxonomic classifications (including the complete lineage) from the NCBI Taxonomy database
We have produced a "Getting Started With cazy_webscraper
" poster.
To download the entire CAZy dataset, and save the data set to the current working directory with the file name cazy_webscraper_<date>_<time>.db
, use the following command structure:
cazy_webscraper <user_email>
Note
The user email address is a requirement of NCBI. NCBI is queried to identify the currect source organism for a given protein, when multiple source organisms are retrieved from CAZy for a single protein. For more information please see the NCBI Entrez documentation.
Below are the list of commands (excluding required and optional arguments) included in cazy_webscraper
.
CAZy
To retrieve data from CAZy and compile and SQLite database using cazy_webscraper
command.
UniProt
To retrieve protein data from UniProt, use the cw_get_uniprot_data
command.
The following data can be retrieved: - UniProt accession - Protein name - EC numbers - PDB accession - Protein sequences
GenBank
- To retrieve protein sequences from GenBank use the
cw_get_genbank_seqs
command. - To retrieve the latest taxonomic classifications from NCBI Taxonomy using the
cw_get_ncbi_taxs
command.
Extract sequences
To extract GenBank and/or UniProt protein sequences from a local CAZyme database, use the cw_extract_db_seqs
command.
PDB
To protein structure files from PDB use the cw_get_pdb_structures
command.
NCBI taxonomies
Retrieve the latest taxonomic classifications (including the complete lineage from kingdom to strain) using the cw_get_ncbi_taxs
command.
GTDB taxonomies
Retrieve the latest taxonomic classifications (incluidng the complete lineage from kingdom to strain) from the GTDB database using the cw_get_gtdb_taxs
command.
Interrogate the database
To interrogate the database, use the cw_query_database
command.
When performing a series of many automated, repeated calls to a server it is polite to do this when internet traffic is lowest at the server. This is typically at the weekend and overnight.
When using cazy_webscraper
to retrieve data from UniProt, NCBI or PDB, the webscraper can appear to run slowly but this may be due to bandwidth at the database server, or server speed. cazy_webscraper
provides a progress bar to reassure the user that the webscraper is working.
Warning
Please do not perform a retrieval of UniProt, NCBI and/or PDB data for the entire CAZy dataset, unless absolutely unavoidable. Retrieving the data from any of these exteranl databases for the entire CAZy dataset will take several hours and may unintentionally deny the service to others.
For details and updates on development, please consult the GitHub repository.
installation quickstart usage tutorial database schema uniprot uniprottutorial genbank genbanktutorial sequence sequencetutorial pdb pdbtutorial ncbitax ncbitaxtutorial genomes genomestutorial gtdbtax gtdbtaxtutorial api apitutorial cache integrate contributing citation license
If you use cazy_webscraper
in your work please do cite our work (including the provided DOI), as well as the specific version of the tool you use. This is not only helpful for us as developers to get our work out into the world, but it is also essential for the reproducibility and integrity of scientific research.
Citation:
Hobbs, E. E. M., Gloster, T. M., and Pritchard, L. (2022) 'cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets', bioRxiv, https://doi.org/10.1101/2022.12.02.518825
This paper includes a full description of the operation and examples of use.
cazy_webscraper
depends on a number of tools. To recognise the contributions that the authors and developers have made, please also cite the following:
- When making an SQLite database:
Hipp, R. D. (2020) SQLite, available: https://www.sqlite.org/index.html.
- Retrieving taxonomic, genomic or sequence data from NCBI:
Cock, P.J.A., Antao, T., Chang, J.T., Chapman, B.A., Cox, C.J., Dalke, A., et al (2009) Biopython: freely available Python tools for computational molecular biology and bioinformatics, Bioinformatics, 25(11), 1422-1423.
Wheeler,D.L., Benson,D.A., Bryant,S., Canese,K., Church,D.M., Edgar,R., Federhen,S., Helmberg,W., Kenton,D., Khovayko,O. et al (2005) Database resources of the National Centre for Biotechnology Information: Update, Nucleic Acid Research, 33, D39-D45
- Retrieving data from UniProt:
Cokelaer, T., Pultz, D., Harder, L. M., Serra-Musach, J., Saez-Rodriguez, J. (2013) BioServices: a common Python package to access biological Web Services programmatically, Bioinformatics, 19(24), 3241-3242.
- Downloading protein structure files from RSCB PDB:
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., et al (2022) The Protein Data Bank, Nucleic Acids Research, 28(1), 235-242.
Hamelryck, T., Manderick, B. (2003), PDB parser and structure class implemented in Python. Bioinformatics, 19 (17), 2308–2310
- Retrieving and using taxonomic data from GTDB:
Parks, D.H., Chuvochina, M., Rinke, C., Mussig, A.J., Chaumeil, P., Hugenholtz, P. (2022) GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy, Nucleic Acids Research, 50(D1), D785-D794.
If there are additional features you wish to be added, you have problems with the scraper, or would like to contribute please raise an issue at the GitHub repository.