Skip to content

sratools driver tool

skripche edited this page Dec 15, 2020 · 2 revisions

The sratools driver tool

With the move of SRA data to various cloud platforms, there becomes more than one source for data. Depending on the location of the user, some sources may be fast and cheap while others may be slow or costly. Sources are not required to provide the exact same data, e.g. some sources might not provide original spot names and quality scores. This becomes a complicated and arbitrary matrix of choices. Accordingly, the sra-toolkit is changing to respond to this.

Rather than change all the tools, we have created a single tool to deal with these changes and to interact with the user. This new tool determines the proper objects to satisfy users' requests. It drives the worker tools with the correct URLs for the runs they are to work on.

For dbGaP users, if you are accessing the data from the same cloud that the data is stored, you will need no decryption to access your permitted data sets.

Using the sratools driver tool

The sratools driver tool is designed to work transparently. So if you wished to run fastq-dump, you would still type fastq-dump, but you would actually get sratools running as fastq-dump. After sratools examines the command line, it runs the original fastq-dump with the information it will need to accomplish its tasks.

Supported cloud platforms

The SRA currently supports Amazon's EC2 and Google's GCP platforms. These are the platforms on which we have copies of the SRA. This list is open-ended. Additional cloud providers and/or regions may be added in the future.

Configuration is required.

You must run configuration at least once; if sratools can not find your configuration, it will print instructions and quit. If you are running in a cloud environment, you will need to configure your cloud settings, and if you want access to data that is located in the same cloud, you will need to allow the toolkit to send your cloud identity token to NCBI.

New command line options

There are some new command line options that all tools get and that are handled by sratools itself. These are related to cloud location and permissions.

  • --ngc <file> Needed in order to read encrypted dbGaP data that is stored at NCBI. NB. this mutually exclusive with --perm.
  • --perm <file> Needed in order to access protected data that is stored in the cloud. NB. this mutually exclusive with --ngc.
  • --location <string> Needed in order to access data that is stored in a different cloud or region, e.g. 's3.us-east-1', 'gs.us'. This is a hint. If the data doesn't exist at the requested location, you will get a location at which the data does exist. NB. Accessing data in a different cloud/region may incur additional costs to you.
  • --cart <file> Needed in order to use a cart file you may have downloaded from dbGaP.

Additionally, sratools can handle multiple accessions at once; even if the underlying tool does not support it, sratools will enable it work.

Parameter transformations

If a parameter is not an option and is not an argument to an option, sratools treats it as a potential SRA accession and requests information about it from NCBI. This replaces the old (pre-2.10) name resolution process. Some options may be removed, particularly any options which are processed by sratools itself.

The name resolution process

Since data can now be located in multiple locations, the new name resolution process aims to locate data that is closest to the user. For users running from a cloud location, this means resolving to data that is stored in the same cloud and region. For users not running in a supported cloud and region, this means resolving to data that is stored at NCBI, as before.

If you have permission and are accessing protected data, e.g. data from dbGaP, and are in a supported cloud, and the data is in the same cloud, name resolution will give you direct access and no decryption will be needed. Otherwise, decryption will still be needed. You will need an NGC file from dbGaP to decrypt the data. The toolkit team continues to work on ways to make this easier while still safeguarding the data, so this is subject to change.

The technical details

Many of the tools in the bin directory have been replaced by symlinks to sratools, with the originals having been renamed to *-orig.

sratools runs the requested tool as a sub-process. For each command line parameter that is not an option or an option's argument,

  1. it performs name resolution.
  2. sets environment variables with any additional information from name resolution.
  3. runs the requested tool with the appropriate options.

It is not recommended or supported to run the original tools directly, it may work, or it may fail. The purpose of sratools is handle this for you. Please allow it to do its job. If sratools is not working for you, it is probably a bug, and we would like the opportunity to fix it.

Very technical details

Environment variables

sratools pays attention to some environment variables, which may be helpful in certain situations.

  • SRATOOLS_VERBOSE setting this equal to a number between 1 and 9 will cause sratools to print verbose messages. NB. sratools does not process --verbose itself, it's sent to the child processes. SRATOOLS_VERBOSE is how to set the verbosity level for sratools itself.
  • SRATOOLS_DRY_RUN setting this equal to 1 is equivalent to setting SRATOOLS_TESTING=3.
  • SRATOOLS_TESTING setting this will enable various test modes.
  1. Runs internal tests and quits. There is no output on success.
  2. Skips name resolution and tersely prints the commands it would have issued.
  3. Does name resolution and verbosely prints the commands it would have issued along with any environment variables it would have set.
  4. Does the same as 3 but prints in a format that should be directly executable if put into a shell script.
  • SRATOOLS_IMPERSONATE will cause sratools to run as-if argv[0] == $SRATOOLS_IMPERSONATE.