Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while downloading Ensembl data for hgfemale_gene_ensembl #193

Open
gaurav opened this issue Oct 17, 2023 · 0 comments
Open

Error while downloading Ensembl data for hgfemale_gene_ensembl #193

gaurav opened this issue Oct 17, 2023 · 0 comments

Comments

@gaurav
Copy link
Collaborator

gaurav commented Oct 17, 2023

I get an Ensembl error when trying to download the hgfemale_gene_ensembl dataset.

mmurdjan_gene_ensembl
Querying mmurdjan_gene_ensembl for attributes {'source', 'zfin_id_id', 'ensembl_peptide_id', 'external_gene_name', 'description', 'gene_biotype', 'ensembl_gene_id', 'entrezgene_id', 'chromosome_name', 'external_gene_source'}.
cldingo_gene_ensembl
Querying cldingo_gene_ensembl for attributes {'source', 'ensembl_peptide_id', 'external_gene_name', 'description', 'gene_biotype', 'ensembl_gene_id', 'entrezgene_id', 'chromosome_name', 'external_gene_source'}.
mgallopavo_gene_ensembl
Querying mgallopavo_gene_ensembl for attributes {'source', 'ensembl_peptide_id', 'external_gene_name', 'description', 'gene_biotype', 'ensembl_gene_id', 'external_synonym', 'entrezgene_id', 'chromosome_name', 'external_gene_source'}.
hgfemale_gene_ensembl
Querying hgfemale_gene_ensembl for attributes {'source', 'zfin_id_id', 'ensembl_peptide_id', 'external_gene_name', 'description', 'gene_biotype', 'ensembl_gene_id', 'external_synonym', 'mgi_id', 'chromosome_name', 'external_gene_source', 'entrezgene_id', 'sgd_gene'}.
RuleException:
_BiomartException in file /Users/gaurav/Development/translator/babel/src/snakefiles/datacollect.snakefile, line 190:
Query ERROR: caught BioMart::Exception::Usage: Too many attributes selected for External References
  File "/Users/gaurav/Development/translator/babel/src/snakefiles/datacollect.snakefile", line 190, in __rule_get_ensembl
  File "/Users/gaurav/Development/translator/babel/src/datahandlers/ensembl.py", line 30, in pull_ensembl
  File "/Users/gaurav/Development/translator/babel/venv/lib/python3.11/site-packages/apybiomart/apybiomart.py", line 84, in query
  File "/Users/gaurav/Development/translator/babel/venv/lib/python3.11/site-packages/apybiomart/classes.py", line 386, in query
  File "/Users/gaurav/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run

As far as I can tell, this is because this is the one dataset that contains five ID fields (an old post on Bioconductor suggests that three might be their limit for External Sources).

In recent Babel runs I've just skipped it on the assumption that Naked Mole Rats are not very relevant to Translator, but a more complete solution would be to replace the Ensembl retrieval code so that if we find too many attributes we can query for, we break it up into smaller groups, then reassemble the final dataset.

def pull_ensembl(complete_file):
f = find_datasets()
cols = {"ensembl_gene_id", "ensembl_peptide_id", "description", "external_gene_name", "external_gene_source",
"external_synonym", "chromosome_name", "source", "gene_biotype", "entrezgene_id", "zfin_id_id", 'mgi_id',
'rgd_id', 'flybase_gene_id', 'sgd_gene', 'wormbase_gene'}
for ds in f['Dataset_ID']:
print(ds)
outfile = make_local_name('BioMart.tsv', subpath=f'ENSEMBL/{ds}')
# Really, we should let snakemake handle this, but then we would need to put a list of all the 200+ sets in our
# config, and keep it up to date. Maybe you could have a job that gets the datasets and writes a dataset file,
# but then updates the config? That sounds bogus.
if os.path.exists(outfile):
continue
atts = find_attributes(ds)
existingatts = set(atts['Attribute_ID'].to_list())
attsIcanGet = cols.intersection(existingatts)
df = query(attributes=list(attsIcanGet), filters={}, dataset=ds)
df.to_csv(outfile, index=False, sep='\t')
with open(complete_file, 'w') as outf:
outf.write(f'Downloaded gene sets for {len(f)} data sets.')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant