Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 111 + 112 uniprot #115

Merged
merged 205 commits into from
May 23, 2023
Merged

Issue 111 + 112 uniprot #115

merged 205 commits into from
May 23, 2023

Conversation

HobnobMancer
Copy link
Owner

Migrate from bioservices.UniProt().get_df() to bioservies.UniProt().mapping():

  • minimise crashing from bioservices get_df method
  • faster: increased batch sizes
  • does not require querying NCBI

retry failed batches and parse ids individually when they may be unlisted in NCBI
remove unused sections of code, and use new code in ncbi.tax.linege
Do not rely on entrez.elink to retrieve the correct protein record id for a given protein accession. Batch fetch protein records to pair the protein id and accession correctly.
Skip downloaded the latest taxonomic information for NCBI were multiple taxonomic classifications are retrieved from CAZy for a protein. The first taxonomy retrieved from CAZy will be added to the local CAZyme database.
no longer retrieve tax data from ncbi as default. Will use the first taxon listed in the cazy db dump. If the new flag is called, retrieve the latest taxonomic classifications from NCBI for proteins listed with multiple taxas in CAZy
retrieve taxonomic classifications from uniprot and add to the local db
associate genus and species
@codecov
Copy link

codecov bot commented May 22, 2023

Codecov Report

Merging #115 (b902dea) into master (a5da4ea) will decrease coverage by 3.13%.
The diff coverage is 24.97%.

@@            Coverage Diff             @@
##           master     #115      +/-   ##
==========================================
- Coverage   56.22%   53.10%   -3.13%     
==========================================
  Files          61       69       +8     
  Lines        5576     6011     +435     
==========================================
+ Hits         3135     3192      +57     
- Misses       2441     2819     +378     

@HobnobMancer
Copy link
Owner Author

New in version 2.3.0

  • Downloading protein data from UniProt is several magnitudes faster than before - and should have fewer issues with using older version of bioservices

    • Uses bioservices mapping to map directly from NCBI protein version accession to UniProt
    • cw_get_uniprot_data not longer calls to NCBI and thus no longer requires an email address as a positional argument
  • Updated database schema: Changed Genbanks 1--* Uniprots to Genbanks *--1 Uniprots. Uniprots.uniprot_id is now listed in the Genbanks table, instead of listing Genbanks.genbank_id in the Uniprots table

  • Retrieve taxonomic classifications from UniProt

    • Use the --taxonomy/-t flag to retrieve the scientific name (genus and species) for proteins of interest
    • Adds downloaded taxonomic information to the UniprotsTaxs table
  • Improved clarrification of deleting old records when using cw_get_uniprot_data

    • Separate arguments to delete Genbanks-EC number and Genbanks-PDB accession relationships that are no longer listed in UniProt for those proteins in the local CAZyme database for proteins whom data is downloaded from UniProt
    • New args:
      • --delete_old_ec_relationships = deletes Genbank(protein)-EC number relationships no longer in UniProt
      • --delete_old_ecs = deletes EC numbers in the local db not linked to any proteins
      • --delete_old_pdb_relationships = deletes Genbank(protein)-PDB relationships no longer in UniProt
      • --delete_old_pdbs = deletes PDB accessions in the local db not linked to any proteins
  • Retrieve the local db schema

    • New command cw_get_db_schema added.
    • Retrieves the SQLite schema of a local CAZyme database and prints it to the terminal
  • Added option to skip retrieving the latest taxonomic classifications NCBI taxonomies

    • By default, when retrieving data from CAZy, cazy_webscraper retrieves the latest taxonomic classifications for proteins listed under multiple tax
    • To increase scrapping time, and to reduce burden on the NCBI-Entrez server, if this data is not needed (e.g. GTDB taxs will be use) this step can be skipped by using the new --skip_ncbi_tax flag.
    • When skipping retrieval of the latest taxa classifications from NCBI, cazy_webscraper will add the first taxa retrieved from CAZy for those proteins listed under multiple taxa
  • Update documentation

    • README
    • Read the docs
    • Updated db schema
    • Added db schema to README

@HobnobMancer HobnobMancer merged commit b720c81 into master May 23, 2023
1 of 3 checks passed
@HobnobMancer HobnobMancer deleted the issue_112_uniprot branch June 21, 2023 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request unit tests Add/update unit tests
Projects
None yet
1 participant