# Data accession example commands for ENA AMR database

---

This notebook's goal is to provide a few examples which are useful for accessing the desired data in the ENA database. It is written in python.

---
_Author: Bálint Ármin Pataki_

---

First we need a username and a password. For the AMR project the first is **dcc_schubert**, the second is private.

In [1]:
user = 'dcc_schubert'
pwd  = '************'

We will use **pandas** for handling the datafiles.

In [2]:
import pandas as pd

Using the [ENA Portal API](https://www.ebi.ac.uk/ena/portal/api/) we can generate links that matches our wishes.  
There is a complete documentation at the link, this notebook only shows a lazy start with copy-paste commands.

A few notes for the commands below:
 - **\$user:\$pwd** is providing the authentication
 - **result=analysis or read_run** is the type of the result we want to obtain.   
 First one contains the analysis related files (in this case the antibiotic resistance files) and the second one contains the raw read files.
 - **query=analysis_type%3D%22AMR_ANTIBIOGRAM** we are interested in the AMR_ANTIBIOGRAM project now.
 - **dataPortal=pathogen** we are looking for the pathogen dataportal.
 - **fields= ...** a list of columns which we are interested in.
 - **dccDataOnly=true** we want only data that is linked to the dcc_schubert datahub.
 
The result is saved to **metaFILE.txt** and **readFILE.txt**

In [3]:
!curl -s -X GET --header 'Accept: text/plain' -u $user:$pwd 'https://www.ebi.ac.uk/ena/portal/api/search?result=analysis&query=analysis_type%3D%22AMR_ANTIBIOGRAM%22&fields=sample_alias%2Csample_accession%2Cscientific_name%2Csubmitted_md5%2Csubmitted_ftp%2Ccountry&dataPortal=pathogen&dccDataOnly=true' > metaFILE.txt
!curl -s -X GET --header 'Accept: text/plain' -u $user:$pwd 'https://www.ebi.ac.uk/ena/portal/api/search?result=read_run&fields=sample_alias%2Clibrary_layout%2Csample_accession%2Cfastq_md5%2Cfastq_ftp&dataPortal=pathogen&dccDataOnly=true' > readFILE.txt

### What do we have?
---
The **result=analysis** contains the linked MIC files, its md5 hash and metadata about it.

In [4]:
metaDF = pd.read_csv('metaFILE.txt', sep='\t')
metaDF.head(3)

Unnamed: 0,analysis_accession,sample_alias,sample_accession,scientific_name,submitted_md5,submitted_ftp,country
0,ERZ373537,UTI_19-II,SAMEA4556054,Escherichia coli,7f4572c9a91ff3d40b873245c6c8badc;c62b6154a043a...,ftp.sra.ebi.ac.uk/vol1/ERZ373/ERZ373537/AMC_UT...,Netherlands
1,ERZ373538,UTI_23,SAMEA4556055,Escherichia coli,98dd5420968c1505c057ea6e77ecbe20;cc6a8421d9e68...,ftp.sra.ebi.ac.uk/vol1/ERZ373/ERZ373538/AMC_UT...,Netherlands
2,ERZ373539,UTI_28,SAMEA4556056,Escherichia coli,51c236775fd5e296685b516a31bfb77e;fb242de563407...,ftp.sra.ebi.ac.uk/vol1/ERZ373/ERZ373539/AMC_UT...,Netherlands


Look at the first entry in details!

We have two files submitted, a .txt and a .md5. The .txt contains the actual MIC values, while the .md5 contains the md5 hash for the MIC file. Also there is a column, called submitted_md5. This holds md5 hash values for both uploaded files. (There is some redundancy for the hashes...)

In [5]:
metaDF.submitted_ftp.tolist()[0]

'ftp.sra.ebi.ac.uk/vol1/ERZ373/ERZ373537/AMC_UTI_19_SAMEA4556054.txt;ftp.sra.ebi.ac.uk/vol1/ERZ373/ERZ373537/ERZ373537.md5'

Let's download them via curl. For some files you will need password authentication, for some, you do not need it.

In [6]:
!curl -s -u $user:$pwd ftp.sra.ebi.ac.uk/vol1/ERZ373/ERZ373537/ERZ373537.md5 > hashFile
!curl -s -u $user:$pwd ftp.sra.ebi.ac.uk/vol1/ERZ373/ERZ373537/AMC_UTI_19_SAMEA4556054.txt > MICfile.txt

In [7]:
!cat hashFile

7f4572c9a91ff3d40b873245c6c8badc  AMC_UTI_19_SAMEA4556054.txt


In [8]:
!md5sum MICfile.txt

7f4572c9a91ff3d40b873245c6c8badc  MICfile.txt


The calculated md5 hash matched the submitted one, so the downloaded file is not corrupted.

---
Now let's have a look at the MIC file. It contains the antibiotic_name, the measurement, its units and some other lab protocol identifiers.

In [9]:
micDF = pd.read_csv('MICfile.txt', sep='\t')
micDF.head(2)

Unnamed: 0,bioSample_ID,species,antibiotic_name,ast_standard,breakpoint_version,laboratory_typing_method,measurement,measurement_units,measurement_sign,resistance_phenotype,platform
0,SAMEA4556054,Escherichia coli,ciprofloxacin,EUCAST,2011,Microbroth dilution,1.0,mg/L,>=,intermediate,Vitek
1,SAMEA4556054,Escherichia coli,norfloxacin,EUCAST,2011,Microbroth dilution,2.0,mg/L,=,resistant,Vitek


The other file, what we got with **result=read_run** setting contains the ftp links for the raw reads. It has similar md5 hash validation as we've seen before.

In [10]:
readDF = pd.read_csv('readFILE.txt', sep='\t')
readDF.head(2)

Unnamed: 0,run_accession,sample_alias,library_layout,sample_accession,fastq_md5,fastq_ftp
0,DRR148121,SAMD00126358,PAIRED,SAMD00126358,b999b9f5eda6f3847ec9c68b52bce544;bd17a9d2a39eb...,ftp.sra.ebi.ac.uk/vol1/fastq/DRR148/DRR148121/...
1,DRR148122,SAMD00126359,PAIRED,SAMD00126359,04a92a50a8b629e434abc352665fa012;396d011a0706b...,ftp.sra.ebi.ac.uk/vol1/fastq/DRR148/DRR148122/...


### We can merge the metadata with the raw reads with the sample_accession column.

In [11]:
mergedDF = pd.merge(readDF, metaDF, on='sample_accession', how='inner')
mergedDF.head(3)

Unnamed: 0,run_accession,sample_alias_x,library_layout,sample_accession,fastq_md5,fastq_ftp,analysis_accession,sample_alias_y,scientific_name,submitted_md5,submitted_ftp,country
0,ERR1417711,sam_103239_20160518_DTU2016_546_PRJ1055_EScher...,PAIRED,SAMEA3993565,97fa323b1425ecf573d5d9067648fd4b;139229d14f074...,ftp.sra.ebi.ac.uk/vol1/fastq/ERR141/001/ERR141...,ERZ390162,sam_103239_20160518_DTU2016_546_PRJ1055_EScher...,Escherichia coli,6c263b3243049bbbb5081cea7d29c755;b84641ebe5076...,ftp.sra.ebi.ac.uk/vol1/ERZ390/ERZ390162/ERZ390...,Denmark
1,ERR1417712,sam_103239_20160518_DTU2016_547_PRJ1055_EScher...,PAIRED,SAMEA3993566,b21d5ecb564426806a459b5e7b92ec3a;33ca1f505080e...,ftp.sra.ebi.ac.uk/vol1/fastq/ERR141/002/ERR141...,ERZ390163,sam_103239_20160518_DTU2016_547_PRJ1055_EScher...,Escherichia coli,56bfc7bef13d72606ac727858c50909a;32b6e7c1b55cb...,ftp.sra.ebi.ac.uk/vol1/ERZ390/ERZ390163/ERZ390...,Denmark
2,ERR1417713,sam_103239_20160518_DTU2016_548_PRJ1055_EScher...,PAIRED,SAMEA3993567,7754fc8c0f741d05e795373960788657;da2456954bc64...,ftp.sra.ebi.ac.uk/vol1/fastq/ERR141/003/ERR141...,ERZ390164,sam_103239_20160518_DTU2016_548_PRJ1055_EScher...,Escherichia coli,3369909c9a0242e8aeb0a8b4ac943e07;403f07f4ec229...,ftp.sra.ebi.ac.uk/vol1/ERZ390/ERZ390164/ERZ390...,Denmark


Now we have the list of files for both MIC values and raw reads. With a foor loop it is quiet easy to download them and work on the local copies.

In [12]:
mergedDF[['fastq_ftp', 'sample_accession', 'submitted_ftp', 'scientific_name']].head(3)

Unnamed: 0,fastq_ftp,sample_accession,submitted_ftp,scientific_name
0,ftp.sra.ebi.ac.uk/vol1/fastq/ERR141/001/ERR141...,SAMEA3993565,ftp.sra.ebi.ac.uk/vol1/ERZ390/ERZ390162/ERZ390...,Escherichia coli
1,ftp.sra.ebi.ac.uk/vol1/fastq/ERR141/002/ERR141...,SAMEA3993566,ftp.sra.ebi.ac.uk/vol1/ERZ390/ERZ390163/ERZ390...,Escherichia coli
2,ftp.sra.ebi.ac.uk/vol1/fastq/ERR141/003/ERR141...,SAMEA3993567,ftp.sra.ebi.ac.uk/vol1/ERZ390/ERZ390164/ERZ390...,Escherichia coli
