Skip to content
A basic Python-based EGA download client
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
pyega3 'log session-id' Jun 17, 2019
test added check for received slice length Jun 5, 2019
.travis.yml 'added TravisCI config' Jun 28, 2018
README.txt genomic range support Sep 13, 2018 genomic range support Sep 13, 2018 genomic range support Sep 13, 2018
requirements.txt Merge pull request #41 from EGA-archive/dependabot/pip/urllib3-1.24.2 May 23, 2019
setup.cfg 'renamed readme file' Jun 28, 2018


EGA python client - pyEGA3
pyEGA3 uses the EGA REST API to download authorized datasets and files

Currently works only with Python3

Python "requests" module
pip3 install requests

Firewall Ports
This client makes https calls to the EGA AAI ( and to the EGA Data API ( Both ports 8443 and 8051 must be reachable from the location where this client script is run. Otherwise you will experience timeouts.
(e.g., should not time out).
sudo pip3 install pyega3
INSTALLATION via Conda(Bioconda channel):
conda config --add channels bioconda
conda config --add channels conda-forge
conda install pyega3
pyega3 [-h] [-d] -cf CREDENTIALS_FILE [-c CONNECTIONS] {datasets,files,fetch} ...

Download from EMBL EBI's EGA (European Genome-phenome Archive)

positional arguments:
    datasets            List authorized datasets
    files               List files in a specified dataset
    fetch               Fetch a dataset or file

optional arguments:
  -h, --help            show this help message and exit
  -d, --debug           Extra debugging messages
                        JSON file containing credentials
                        Download using specified number of connections                      
Credentials file supposed to be in json format e.g:
    "username": "",
    "password": "mypassword",    

Your username and password are provided to you by EGA.
Specifying password is not mandatory - if password is not provided 
the user will be asked to enter it from the console


Parallelism ( download via multiple connections ) works on the file level, 
but still usable while downloading whole dataset. 
If -c command line switch is provided all big files (>100Mb) in the 
dataset will be downloaded using specified # of connections.

The number of connections breaks down individual file downloads into segments, 
which are then downloaded in parallel. So using a very high number actually 
introduces overhead that slows down the download of the file.
Files are still downloaded in sequence – so multiple connections doesn't mean 
downloading multiple files in parallel, if an entire dataset is being downloaded.


GENOMIC RANGE REQUESTS ( via Htsget protocol ) :

usage: pyega3 fetch [-h] [--reference-name REFERENCE_NAME]
                    [--reference-md5 REFERENCE_MD5] [--start START]
                    [--end END] [--format {BAM,CRAM}] [--saveto [SAVETO]]

positional arguments:
  identifier            Id for dataset (e.g. EGAD00000000001) or file (e.g.

optional arguments:
  -h, --help            show this help message and exit
  --reference-name REFERENCE_NAME, -r REFERENCE_NAME
                        The reference sequence name, for example 'chr1', '1',
                        or 'chrX'. If unspecified, all data is returned.
  --reference-md5 REFERENCE_MD5, -m REFERENCE_MD5
                        The MD5 checksum uniquely representing the requested
                        reference sequence as a lower-case hexadecimal string,
                        calculated as the MD5 of the upper-case sequence
                        excluding all whitespace characters.
  --start START, -s START
                        The start position of the range on the reference,
                        0-based, inclusive. If specified, reference-name or
                        reference-md5 must also be specified.
  --end END, -e END     The end position of the range on the reference,
                        0-based exclusive. If specified, reference-name or
                        reference-md5 must also be specified.
  --format {BAM,CRAM}, -f {BAM,CRAM}
                        The format of data to request.
  --max-retries MAX_RETRIES, -M MAX_RETRIES
                        The maximum number of times to retry a failed
                        transfer. Any negative number means infinite number of
                        retries( default value = 5 ).
  --retry-wait RETRY_WAIT, -W RETRY_WAIT
                        The number of seconds to wait before retrying a failed
                        transfer( default value = 5 ).
  --saveto [SAVETO]     Output file(for files)/output dir(for datasets)

You can’t perform that action at this time.