<h1>Getting data files from NCBI SRA with eUtils</h1>
<p>First, ensure that BioPython is installed.</p>
<pre> sudo /home/.../anaconda3/bin/pip install biopython</pre>

In [119]:
from Bio import Entrez 
import xmltodict
import pandas as pd
import numpy as np

In [120]:
#to avoid anaonymous requests, adjust as needed
Entrez.email = "my@email.eu"

In [137]:
database="sra"
search_terms="estrogen receptor chip-seq AND homo sapiens[Organism]"
maxhits=200
handle=Entrez.esearch(db=database,term=search_terms,retmax=maxhits)

In [138]:
records = Entrez.read(handle)

In [139]:
identifiers = records['IdList']
identifiers

['5772180', '5772179', '5772178', '5772177', '5772176', '5679946', '5679945', '5679944', '5679943', '5679942', '5679941', '5679940', '5679939', '5679938', '5679937', '5026291', '5026290', '5026289', '5026288', '5026287', '5026286', '5026285', '5026284', '5026283', '5026282', '5026281', '4463046', '4463045', '4463044', '4463043', '4463042', '4463041', '4463040', '4463039', '4463038', '4463037', '4463036', '4463035', '4463034', '4463033', '4463032', '4463031', '4463030', '4463029', '4417855', '4417854', '4417853', '4417852', '4417851', '4417850', '4417849', '4417848', '4417847', '4417846', '4417845', '4417844', '4417843', '4417842', '4417841', '4417840', '4417839', '4417838', '4417837', '4417836', '4417835', '4417834', '4417833', '4417832', '4285995', '4285994', '4285993', '4285992', '4285991', '4285990', '4285989', '4285988', '4285987', '4285986', '4175998', '4175997', '4175996', '4175995', '4175994', '4175993', '4175992', '4175991', '4175990', '4175989', '4175988', '4175987', '4148037'

In [140]:
handle = Entrez.efetch(db=database, id=identifiers, retmax="200", retmode="text")

In [141]:
#ecord = Entrez.read(handle,validate=False
#records=Entrez.parse(handle)
#for record in records:
#rint(handle.read())
text=handle.read()


In [142]:
import xmltodict

In [143]:
doc=xmltodict.parse(text)

<h2>Parsing the output of SRA/eUtils</h2>
<p>The <tt>Entrez.efetch</tt> command returns XML, which the <tt>xmltodict</tt> package transforms into a hierarchical 
Python dictionary. The top element is <tt>EXPERIMENT_PACKAGE_SET</tt>. Underneath that there is a list of 
<tt>EXPERIMENT_PACKAGE</tt> objects. There should be one such <tt>EXPERIMENT_PACKAGE</tt> for each ChIP-seq experiment that SRA returns.</p>
<p>The following function extracts information from an individual <tt>EXPERIMENT_PACKAGE</tt> and puts the info
    into a row of a pandas dataframe.</p>

In [144]:

def parse_experiment_package(ep,i,df):
    #experiment=ep['EXPERIMENT']
    for item in ep:
        if (item=='EXPERIMENT'):
            experiment=ep[item]
            experiment_alias = experiment['@alias']
            experiment_accession= experiment['@accession']
            title=experiment['TITLE']
            studyref=experiment['STUDY_REF'] # this is an ordered dict
            study_accession=studyref['@accession']
            study_refname=studyref['@refname']
            #print(study_accession,study_refname)
            design=experiment['DESIGN']# this is an ordered dict
            sample_descriptor=design['SAMPLE_DESCRIPTOR']
            sample_accession=sample_descriptor['@accession']
            library_descriptor=design['LIBRARY_DESCRIPTOR']# this is an ordered dict
            library_strategy=library_descriptor['LIBRARY_STRATEGY']
            library_construction=library_descriptor['LIBRARY_CONSTRUCTION_PROTOCOL']
            #print(sample_accession,library_strategy,library_construction)
            platform=experiment['PLATFORM']
            illumina=platform['ILLUMINA']
            instrument=illumina['INSTRUMENT_MODEL']
            #print(instrument)
            #experiment_links=experiment['EXPERIMENT_LINKS']
            #experiment_attributes=experiment['EXPERIMENT_ATTRIBUTES']
            submission=ep['SUBMISSION']
            submission_alias=submission['@alias']
            submission_accession=submission['@accession']
            study=ep['STUDY']
            descriptor=study['DESCRIPTOR']
            study_title=descriptor['STUDY_TITLE']
            study_type=descriptor['STUDY_TYPE']
            study_abstract=descriptor['STUDY_ABSTRACT']
            df.loc[i]=[experiment_alias,experiment_accession,title,
               study_accession,study_refname,sample_accession,library_strategy,library_construction,
               instrument,submission_alias,submission_accession,study_title,study_type,study_abstract]

<h2>Create a pandas dataframe to hold the results</h2>
<p>
We will now create a pandas dataframe to hold the results, defining columns that match those sought after in the above function.
Not good style, refactor later on!</p>

In [145]:
numberOfRows=maxhits
df=pd.DataFrame(index=np.arange(0,numberOfRows),columns=('experiment_alias','experiment_accession','title',
               'study_accession','study_refname','sample_accession','library_strategy','library_construction',
               'instrument','submission_alias','submission_accession','study_title','study_type','study_abstract'))

In [147]:

doc2=doc['EXPERIMENT_PACKAGE_SET']
#the following gets a list of EXPERIMENT_PACKAGE objects
doc3=doc2['EXPERIMENT_PACKAGE']
i=0
for ep in doc3:
    try:
        parse_experiment_package (ep,i,df)
        i=i+1
    except:
        print("Could not parse this entry")

Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry
Could not parse this entry


In [148]:
from IPython.display import display, HTML
display(df)

Unnamed: 0,experiment_alias,experiment_accession,title,study_accession,study_refname,sample_accession,library_strategy,library_construction,instrument,submission_alias,submission_accession,study_title,study_type,study_abstract
0,GSM3212403,SRX4284795,GSM3212403: T47D_YFP_noE2_HA_ChIPseq; Homo sap...,SRP151140,GSE116170,SRS3448952,ChIP-Seq,Cells were grown hormone-deprived media for 7 ...,Illumina HiSeq 2500,GEO: GSE116170,SRA726535,Transcriptional properties of estrogen recepto...,{'@existing_study_type': 'Other'},RNA sequencing (RNA-seq) detects estrogen rece...
1,GSM3212402,SRX4284794,GSM3212402: T47D_ESR1NOP2_noE2_HA_ChIPseq; Hom...,SRP151140,GSE116170,SRS3448951,ChIP-Seq,Cells were grown hormone-deprived media for 7 ...,Illumina HiSeq 2500,GEO: GSE116170,SRA726535,Transcriptional properties of estrogen recepto...,{'@existing_study_type': 'Other'},RNA sequencing (RNA-seq) detects estrogen rece...
2,GSM3212401,SRX4284793,GSM3212401: T47D_ESR1PCDH11X_noE2_HA_ChIPseq; ...,SRP151140,GSE116170,SRS3448950,ChIP-Seq,Cells were grown hormone-deprived media for 7 ...,Illumina HiSeq 2500,GEO: GSE116170,SRA726535,Transcriptional properties of estrogen recepto...,{'@existing_study_type': 'Other'},RNA sequencing (RNA-seq) detects estrogen rece...
3,GSM3212400,SRX4284792,GSM3212400: T47D_ESR1YAP1_noE2_HA_ChIPseq; Hom...,SRP151140,GSE116170,SRS3448949,ChIP-Seq,Cells were grown hormone-deprived media for 7 ...,Illumina HiSeq 2500,GEO: GSE116170,SRA726535,Transcriptional properties of estrogen recepto...,{'@existing_study_type': 'Other'},RNA sequencing (RNA-seq) detects estrogen rece...
4,GSM3212399,SRX4284791,GSM3212399: T47D_ESR1WT_noE2_HA_ChIPseq; Homo ...,SRP151140,GSE116170,SRS3448967,ChIP-Seq,Cells were grown hormone-deprived media for 7 ...,Illumina HiSeq 2500,GEO: GSE116170,SRA726535,Transcriptional properties of estrogen recepto...,{'@existing_study_type': 'Other'},RNA sequencing (RNA-seq) detects estrogen rece...
5,GSM3184880,SRX4193150,GSM3184880: ERalpha s10 E2 ChIPSeq; Homo sapie...,SRP150236,GSE115607,SRS3403984,ChIP-Seq,Cells were formaldehyde fixed for 15 minutes (...,NextSeq 500,GEO: GSE115607,SRA719221,ChIP-Seq analysis of estrogen deprived MCF7 ce...,{'@existing_study_type': 'Other'},The goal of this experiment was to interrogate...
6,GSM3184879,SRX4193149,GSM3184879: ERalpha s9 Fulv ChIPSeq; Homo sapi...,SRP150236,GSE115607,SRS3403986,ChIP-Seq,Cells were formaldehyde fixed for 15 minutes (...,NextSeq 500,GEO: GSE115607,SRA719221,ChIP-Seq analysis of estrogen deprived MCF7 ce...,{'@existing_study_type': 'Other'},The goal of this experiment was to interrogate...
7,GSM3184878,SRX4193148,GSM3184878: ERalpha s8 Tam ChIPSeq; Homo sapie...,SRP150236,GSE115607,SRS3403983,ChIP-Seq,Cells were formaldehyde fixed for 15 minutes (...,NextSeq 500,GEO: GSE115607,SRA719221,ChIP-Seq analysis of estrogen deprived MCF7 ce...,{'@existing_study_type': 'Other'},The goal of this experiment was to interrogate...
8,GSM3184877,SRX4193147,GSM3184877: ERalpha s7 s5942 ChIPSeq; Homo sap...,SRP150236,GSE115607,SRS3403982,ChIP-Seq,Cells were formaldehyde fixed for 15 minutes (...,NextSeq 500,GEO: GSE115607,SRA719221,ChIP-Seq analysis of estrogen deprived MCF7 ce...,{'@existing_study_type': 'Other'},The goal of this experiment was to interrogate...
9,GSM3184876,SRX4193146,GSM3184876: ERalpha s6 5942 ChIPSeq; Homo sapi...,SRP150236,GSE115607,SRS3403979,ChIP-Seq,Cells were formaldehyde fixed for 15 minutes (...,NextSeq 500,GEO: GSE115607,SRA719221,ChIP-Seq analysis of estrogen deprived MCF7 ce...,{'@existing_study_type': 'Other'},The goal of this experiment was to interrogate...
