Requirements: having successfully run the steps in database creation.
Requirements2: check that the species are available in Ensembl or EnsemblMetazoa.
Goal: insert the species used in Bgee, the related taxonomy from NCBI, and sex information about species.
source_files/species/bgeeSpecies.tsvcontains the species used in Bgee. This is the file to modify to add/remove a species.
If you add/remove some species, you need to also update the files pipeline/db_creation/insert_data_sources_to_species.sql and
source_files/species/bgeeSpecies.tsvfile: The first line is a header line, it must contain the following columns:
speciesId: The NCBI taxonomy species ID (e.g., 9606 for human).
genus: genus of the species.
species: species name of the species.
speciesCommonName: species common name
genomeFilePath: path to the genome file of the species on the Ensembl FTP. If no genome available for this species, this can point to the genome of a closely related species (e.g., use of chimpanzee genome for bonobo)
genomeVersion: genome version used
dataSourceId: ID of the data source providing the genome (currently, either Ensembl or EnsemblMetazoa)
genomeSpeciesId: the NCBI taxonomy ID of the species whose genome is being used. In most cases, it is the same as
speciesId, but for some species with no genome available, it is possible to use the genome of a closely related species (e.g., use of chimpanzee genome for bonobo).
fakeGeneIdPrefix: If the genome of another species is used, the prefixes of the Ensembl gene IDs of this other species will be replaced with the provided prefix.
keywords: a list of keywords/alternative names associated to this species, separated by the character
Note that it is from this file that the information about names of species are obtained, so, no errors allowed in it.
If a line starts with
#, it is commented and the species will not be inserted
This pipeline step requires the NCBI taxonomy, provided as an ontology.
We cannot use the official taxonomy ontology because, as of Bgee 13, it does not include the last modifications that we requested to NCBI, and that were accepted (e.g., addition of a Dipnotetrapodomorpha term). Also, to correctly infer taxon constraints at later steps, we need this ontology to include disjoint classes axioms between sibling taxa, as explained in a Chris Mungall blog post. The default ontology does not include those.
This pipeline step is thus capable of generating its own version of the NCBI taxonomy ontology, in the exact same way as for the official ontology, as described on the OBOFoundry wiki (see notably the Makefile generating the ontology). It is based on files available from the NCBI FTP (ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/taxonomy.dat). The code to generate disjoint classes axioms is based on the code from the owltools Java class
owltools.cli.TaxonCommandRunner, in the module
This custom taxonomy will include the species used in Bgee and their ancestors, the taxa used in our annotations and their ancestors, the taxa used in Uberon an their ancestors. To extract taxa used in Uberon, we use the
extversion (this is the one containing more taxa).
Note that the generation of the taxonomy requires about 15Go of memory.
This pipeline steps will insert sex information about species, see
If it is the first time you execute this step in this pipeline run:
Modify the file pipeline/db_creation/update_data_sources.sql: you need to add the last modification date of the taxonomy used. This information can be found by looking at the file
generated_files/species/step_verification_RELEASE.txtshould contain: the total number of species inserted, the total number of taxa inserted, the number of taxa inserted that represent the least common ancestor of at least two species used in Bgee; the complete list of species ordered by their ID; the complete list of taxa least common ancestor, ordered by their position in the taxonomy (root to leaf); the complete list of taxa ordered by their position in the taxonomy (root to leaf).
- Compare the species list between releases.
- Check that there is no species with missing sex information. Otherwise, you need to update the file
- The taxa should be displayed ordered from root to leaf, and taxa of a same level should be ordered by alphabetical order of their scientific name. Verify it is correct, it is important.
The following files should have been generated:
generated_files/species/bgee_ncbitaxon.owl, our custom taxonomy ontology
generated_files/species/annotTaxIds.tsv, a TSV file containing the IDs of the taxa used in our annotations
generated_files/species/allTaxIds.tsv, a TSV file containing the IDs of the species used in Bgee, of the taxa used in our annotations, of the taxa used in Uberon.
You can have an exception thrown, saying that a specified taxon does not exist in the taxonomy ontology, for instance
java.lang.IllegalArgumentException: Taxon NCBITaxon:71164 was not found in the ontology. It likely means that an incorrect/deprecated taxon is used in Uberon. Remove the ID of the taxon in the file
generated_files/species/allTaxIds.tsv(so if the exception is related to a taxon
NCBITaxon_71164, remove the ID
71164). Check if the taxon ID is present in the file
generated_files/species/annotTaxIds.tsv, if it is, the error is on us. Otherwise, report the problem on the Uberon tracker, if you identified the taxon in Uberon. You will most likely need to manually modify the ontology to remove the offending taxa in the mean time.
If you need to re-insert the data in the database, use
make deleteSpeciesAndTaxa, then
Other notable Makefile targets
- generate the TSV files containing the taxa used in out annotations:
- generate the TSV files containing all taxa used in Bgee, in our annotations, in Uberon
- generate the taxonomy ontology:
- Remove the species and taxa from the database (this is not done when calling
clean, to avoid wiping the database by accident)