Error while downloading Ensembl data for hgfemale_gene_ensembl #193

gaurav · 2023-10-17T02:33:37Z

I get an Ensembl error when trying to download the hgfemale_gene_ensembl dataset.

mmurdjan_gene_ensembl
Querying mmurdjan_gene_ensembl for attributes {'source', 'zfin_id_id', 'ensembl_peptide_id', 'external_gene_name', 'description', 'gene_biotype', 'ensembl_gene_id', 'entrezgene_id', 'chromosome_name', 'external_gene_source'}.
cldingo_gene_ensembl
Querying cldingo_gene_ensembl for attributes {'source', 'ensembl_peptide_id', 'external_gene_name', 'description', 'gene_biotype', 'ensembl_gene_id', 'entrezgene_id', 'chromosome_name', 'external_gene_source'}.
mgallopavo_gene_ensembl
Querying mgallopavo_gene_ensembl for attributes {'source', 'ensembl_peptide_id', 'external_gene_name', 'description', 'gene_biotype', 'ensembl_gene_id', 'external_synonym', 'entrezgene_id', 'chromosome_name', 'external_gene_source'}.
hgfemale_gene_ensembl
Querying hgfemale_gene_ensembl for attributes {'source', 'zfin_id_id', 'ensembl_peptide_id', 'external_gene_name', 'description', 'gene_biotype', 'ensembl_gene_id', 'external_synonym', 'mgi_id', 'chromosome_name', 'external_gene_source', 'entrezgene_id', 'sgd_gene'}.
RuleException:
_BiomartException in file /Users/gaurav/Development/translator/babel/src/snakefiles/datacollect.snakefile, line 190:
Query ERROR: caught BioMart::Exception::Usage: Too many attributes selected for External References
  File "/Users/gaurav/Development/translator/babel/src/snakefiles/datacollect.snakefile", line 190, in __rule_get_ensembl
  File "/Users/gaurav/Development/translator/babel/src/datahandlers/ensembl.py", line 30, in pull_ensembl
  File "/Users/gaurav/Development/translator/babel/venv/lib/python3.11/site-packages/apybiomart/apybiomart.py", line 84, in query
  File "/Users/gaurav/Development/translator/babel/venv/lib/python3.11/site-packages/apybiomart/classes.py", line 386, in query
  File "/Users/gaurav/.pyenv/versions/3.11.2/lib/python3.11/concurrent/futures/thread.py", line 58, in run

As far as I can tell, this is because this is the one dataset that contains five ID fields (an old post on Bioconductor suggests that three might be their limit for External Sources).

In recent Babel runs I've just skipped it on the assumption that Naked Mole Rats are not very relevant to Translator, but a more complete solution would be to replace the Ensembl retrieval code so that if we find too many attributes we can query for, we break it up into smaller groups, then reassemble the final dataset.

Babel/src/datahandlers/ensembl.py

Lines 13 to 32 in 8647162

    
           def pull_ensembl(complete_file): 
        
               f = find_datasets() 
        
               cols = {"ensembl_gene_id", "ensembl_peptide_id", "description", "external_gene_name", "external_gene_source", 
        
                       "external_synonym", "chromosome_name", "source", "gene_biotype", "entrezgene_id", "zfin_id_id", 'mgi_id', 
        
                       'rgd_id', 'flybase_gene_id', 'sgd_gene', 'wormbase_gene'} 
        
               for ds in f['Dataset_ID']: 
        
                   print(ds) 
        
                   outfile = make_local_name('BioMart.tsv', subpath=f'ENSEMBL/{ds}') 
        
                   # Really, we should let snakemake handle this, but then we would need to put a list of all the 200+ sets in our 
        
                   # config, and keep it up to date.  Maybe you could have a job that gets the datasets and writes a dataset file, 
        
                   # but then updates the config? That sounds bogus. 
        
                   if os.path.exists(outfile): 
        
                       continue 
        
                   atts = find_attributes(ds) 
        
                   existingatts = set(atts['Attribute_ID'].to_list()) 
        
                   attsIcanGet = cols.intersection(existingatts) 
        
                   df = query(attributes=list(attsIcanGet), filters={}, dataset=ds) 
        
                   df.to_csv(outfile, index=False, sep='\t') 
        
               with open(complete_file, 'w') as outf: 
        
                   outf.write(f'Downloaded gene sets for {len(f)} data sets.')

The text was updated successfully, but these errors were encountered:

gaurav added this to the Babel - not-urgent milestone Oct 17, 2023

gaurav mentioned this issue Oct 17, 2023

Fix for Ensembl downloads with too many attributes #194

Draft

gaurav added the Data Source: Ensembl label Oct 17, 2023

gaurav mentioned this issue Dec 2, 2023

Babel v1.3 ongoing fixes #201

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while downloading Ensembl data for hgfemale_gene_ensembl #193

Error while downloading Ensembl data for hgfemale_gene_ensembl #193

gaurav commented Oct 17, 2023

Error while downloading Ensembl data for hgfemale_gene_ensembl #193

Error while downloading Ensembl data for hgfemale_gene_ensembl #193

Comments

gaurav commented Oct 17, 2023