# Download SARS-CoV-2 genomic data


**2020-11-29 - Pablo Riesgo Ferreiro**

* [NCBI Entrez Biopython setup](#entrez-setup)
* [Download SARS-CoV-2 protein assemblies](#protein-assemblies)
* [Download SARS-CoV-2 DNA sequences](#dna-sequences)


The aim of this prototype is to download genomic data and protein assemblies available in public databases from SARS-CoV-2. For the purpose of downloading protein data the NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) database has been used before. Only a subset of the data in NCBI Virus is available in the Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra). All datasets available in NCBI Virus are available through NCBI Entrez (https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html), at the time of this writing Entrez contains 456,634 protein sequences for SARS-CoV-2, while NCBI Virus contains 456,535. Furthermore, the genomic data available in the SRA is also available through Entrez, at the time of this writing there are 144812 entries.


NCBI Entrez makes its data available through a REST API, see here for further details https://www.ncbi.nlm.nih.gov/books/NBK25500/.

As an example, this query fetch the metadata for protein sequences on SARS-CoV-2: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=sars-cov-2[organism]
```
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>456634</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdList>
<Id>1937228382</Id>
<Id>1937228381</Id>
<Id>1937228380</Id>
<Id>1937228379</Id>
<Id>1937228378</Id>
<Id>1937228377</Id>
<Id>1937228376</Id>
<Id>1937228375</Id>
<Id>1937228374</Id>
<Id>1937228373</Id>
<Id>1937228372</Id>
<Id>1937228371</Id>
<Id>1937227139</Id>
<Id>1937227138</Id>
<Id>1937227137</Id>
<Id>1937227136</Id>
<Id>1937227135</Id>
<Id>1937227134</Id>
<Id>1937227133</Id>
<Id>1937227132</Id>
</IdList><TranslationSet><Translation>     <From>sars-cov-2[organism]</From>     <To>"Severe acute respiratory syndrome coronavirus 2"[Organism]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>"Severe acute respiratory syndrome coronavirus 2"[Organism]</Term>    <Field>Organism</Field>    <Count>456634</Count>    <Explode>Y</Explode>   </TermSet>   <OP>GROUP</OP>  </TranslationStack><QueryTranslation>"Severe acute respiratory syndrome coronavirus 2"[Organism]</QueryTranslation></eSearchResult>
```

The above query returns the first 20 results, but the API allows to paginate through the whole dataset using the parameters `retMax` and `retStart`.

A query such as this allows to fetch the protein sequence in FASTA format: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=1937228382&rettype=fasta&retmode=text

Several Python packages exist that make use of the Entrez API and  make the data readily available. The most popular seem to be:

- Biopython includes support for Entrez https://biopython.org/docs/1.75/api/Bio.Entrez.html
- Entrezpy is a dedicated package for this purpose https://academic.oup.com/bioinformatics/article/35/21/4511/5488119

Here we show how to use Biopython to fetch the Sars-cov-2 data from NCBI Entrez.

## NCBI Entrez biopython setup <a class="anchor" id="entrez-setup"></a>

In [1]:
from Bio import Entrez
import json

To make use of NCBI's E-utilities, NCBI requires you to specify your
email address with each request.  As an example, if your email address
is A.N.Other@example.com, you can specify it as follows:
   from Bio import Entrez
   Entrez.email = 'A.N.Other@example.com'
In case of excessive usage of the E-utilities, NCBI will attempt to contact
a user at the email address provided before blocking access to the
E-utilities.

In [2]:
Entrez.email = "pablo.riesgoferreiro@tron-mainz.de"

For a production system making an intensive use of the Entrez API there is the possibility of configuring some API keys that need to be granted by Entrez.

> All the functions that send requests to the NCBI Entrez API will automatically respect the NCBI rate limit (of 3 requests per second without an API key, or 10 requests per second with an API key) and will automatically retry when encountering transient failures (i.e. connection failures or HTTP 5XX codes). (https://biopython.org/docs/1.75/api/Bio.Entrez.html#module-Bio.Entrez)

Entrez provides access to multiple NCBI databases, these are listed below. Each different database supports the output in different formats, this is described here https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly.

In [3]:
entrez_info = Entrez.read(Entrez.einfo())
print(json.dumps(entrez_info, indent=3))

{
   "DbList": [
      "pubmed",
      "protein",
      "nuccore",
      "ipg",
      "nucleotide",
      "structure",
      "sparcle",
      "protfam",
      "genome",
      "annotinfo",
      "assembly",
      "bioproject",
      "biosample",
      "blastdbinfo",
      "books",
      "cdd",
      "clinvar",
      "gap",
      "gapplus",
      "grasp",
      "dbvar",
      "gene",
      "gds",
      "geoprofiles",
      "homologene",
      "medgen",
      "mesh",
      "ncbisearch",
      "nlmcatalog",
      "omim",
      "orgtrack",
      "pmc",
      "popset",
      "proteinclusters",
      "pcassay",
      "biosystems",
      "pccompound",
      "pcsubstance",
      "seqannot",
      "snp",
      "sra",
      "taxonomy",
      "biocollections",
      "gtr"
   ]
}


## Download SARS-CoV-2 protein assemblies <a class="anchor" id="protein-assemblies"></a>

Here we show how to download protein assemblies of different samples taken from COVID patients. The API allows to apply different search criteria like for instance restricting the data to a protein of interest, eg: the spike protein. Also, the considered wild type reference is available through this API.

### Search for protein assemblies given a certain criteria

Search for all protein entries for SARS-CoV-2 on the spike protein, known here as `surface glycoprotein`. The search only returns identifiers of entries complying with the search criteria. By default it returns only 20 entries.

In [4]:
def search_spike_proteins(retmax=5, retstart=0):
    database = "protein"
    search_term = "(sars-cov-2[Organism]) AND surface glycoprotein[Protein Name]"
    handle = Entrez.esearch(db=database, retmax=retmax, retstart=retstart, term=search_term)
    search_results = Entrez.read(handle)
    handle.close()
    return search_results

In [5]:
protein_search_results = search_spike_proteins()
print(json.dumps(protein_search_results, indent=3))

{
   "Count": "38258",
   "RetMax": "5",
   "RetStart": "0",
   "IdList": [
      "1937378878",
      "1937376165",
      "1937371065",
      "1937228373",
      "1937227130"
   ],
   "TranslationSet": [
      {
         "From": "sars-cov-2[Organism]",
         "To": "\"Severe acute respiratory syndrome coronavirus 2\"[Organism]"
      }
   ],
   "TranslationStack": [
      {
         "Term": "\"Severe acute respiratory syndrome coronavirus 2\"[Organism]",
         "Field": "Organism",
         "Count": "456670",
         "Explode": "Y"
      },
      {
         "Term": "surface glycoprotein[Protein Name]",
         "Field": "Protein Name",
         "Count": "39100",
         "Explode": "N"
      },
      "AND"
   ],
   "QueryTranslation": "\"Severe acute respiratory syndrome coronavirus 2\"[Organism] AND surface glycoprotein[Protein Name]"
}


### Pagination through the whole dataset

We can paginate through the whole dataset using the parameters `retmax` and `retstart`. In this example we just fetch 5 consecutive pages, each containing 2 entries. This approach can be extended to iterate through the whole dataset.

In [6]:
identifiers = []
for i in range(5):
    page_size = 2
    search_results = search_spike_proteins(retmax=page_size, retstart=page_size * i)
    identifiers = identifiers + search_results.get("IdList", [])
identifiers

['1937378878',
 '1937376165',
 '1937371065',
 '1937228373',
 '1937227130',
 '1937191502',
 '1937191487',
 '1937191473',
 '1937191459',
 '1937191444']

### Fetch entry metadata

Fetch all metadata associated to a given protein with a known identifier

In [7]:
def get_protein_metadata(identifier):
    database = "protein"
    handle = Entrez.esummary(db=database, id=identifier)
    protein = Entrez.read(handle)
    handle.close()
    return protein

In [8]:
protein_metadata = get_protein_metadata(identifier=identifiers[0])
print(json.dumps(protein_metadata, indent=3))

[
   {
      "Item": [],
      "Id": "1937378878",
      "Caption": "QPI70337",
      "Title": "surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]",
      "Extra": "gi|1937378878|gb|QPI70337.1|[1937378878]",
      "Gi": 1937378878,
      "CreateDate": "2020/11/26",
      "UpdateDate": "2020/11/26",
      "Flags": 0,
      "TaxId": 2697049,
      "Length": 1273,
      "Status": "live",
      "ReplacedBy": "",
      "Comment": "  ",
      "AccessionVersion": "QPI70337.1"
   }
]


### Fetch protein sequence

Given an identifier we can fetch the protein sequence in FASTA format. Here we fetch the first 4 entries.

In [9]:
def get_protein_fasta(identifiers):
    handle = Entrez.efetch(db="protein", id=identifiers, rettype="fasta", retmode="text")
    fasta = handle.read().strip()
    handle.close()
    return fasta

In [10]:
fasta = get_protein_fasta(identifiers[0:4])
print(fasta)

>QPI70337.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHV
SGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPF
LGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPI
NLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYN
ENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASV
YAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGNTPCNGVEGFNCYF
PLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFL
PFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQGVNCTEVPVAIHADQLT
PTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLG
AENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGI
AVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYR

Or we can fetch the whole entry in GenPept format (https://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/Format/sample_record_genpept.html) which is conveniently parsed into JSON. Here we have all metadata related to this protein assembly, such as the geographical location; but also the amino acids sequence.

In [11]:
def get_protein_genpept(identifier):
    handle = Entrez.efetch(db="protein", id=identifier, rettype="gb", retmode="xml")
    entry = Entrez.read(handle)
    handle.close()
    return entry

In [12]:
genpept = get_protein_genpept(identifier=identifiers[0])
print(json.dumps(genpept, indent=3))

[
   {
      "GBSeq_locus": "QPI70337",
      "GBSeq_length": "1273",
      "GBSeq_moltype": "AA",
      "GBSeq_topology": "linear",
      "GBSeq_division": "VRL",
      "GBSeq_update-date": "26-NOV-2020",
      "GBSeq_create-date": "26-NOV-2020",
      "GBSeq_definition": "surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]",
      "GBSeq_primary-accession": "QPI70337",
      "GBSeq_accession-version": "QPI70337.1",
      "GBSeq_other-seqids": [
         "gb|QPI70337.1|",
         "gi|1937378878"
      ],
      "GBSeq_source": "Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)",
      "GBSeq_organism": "Severe acute respiratory syndrome coronavirus 2",
      "GBSeq_taxonomy": "Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus",
      "GBSeq_references": [
         {
            "GBReference_reference": "1",
            "GBReference_position": "1.

Fetch all entries for which we have identifiers and extract the amino acids sequence.

In [13]:
protein_sequences = list(map(lambda x: get_protein_genpept(identifier=x)[0].get("GBSeq_sequence").upper(), identifiers))

### Fetch the reference sequence

There are multiple assembly versions available for the SARS-CoV-2 and in particular there are 3 with the most recent submission date
```
accession:    GCA_011545335.2
name :        ASM1154533v2
submission:   2020/03/12 00:00
accession:    GCA_011545325.2
name :        ASM1154532v2
submission:   2020/03/12 00:00
accession:    GCA_011545285.2
name :        ASM1154528v2
submission:   2020/03/12 00:00
```

We will use from now on the first entry, GCA_011545335.2 which is defined here https://www.ncbi.nlm.nih.gov/nuccore/MT184907.2. For some reason the accession id MT184907.2 cannot be fetched from Entrez assembly database.

In [14]:
# this chunk of code purpose is just to inspect the available assemblies
handle = Entrez.esearch(db="assembly", term="(sars-cov-2[Organism])")
search_results = Entrez.read(handle)
reference_identifiers = search_results.get("IdList", [])
handle.close()

assemblies = []
for i in reference_identifiers:
    handle = Entrez.esummary(db="assembly", id=i, report="full")
    summary = Entrez.read(handle)['DocumentSummarySet']['DocumentSummary']
    if len(summary) > 0:
        assembly_accession = summary[0]['AssemblyAccession']
        assembly_name = summary[0]['AssemblyName']
        submission_date = summary[0]['SubmissionDate']
        assemblies.append((assembly_accession, assembly_name, submission_date))

Fetch the whole reference genome with annotations

In [15]:
# fetch the assembly id given the accession id
def get_sars_cov_2_reference():
    assembly_id = Entrez.read(Entrez.esearch(db="nuccore", term="MT184907.2"))["IdList"][0]
    seq_record = Entrez.efetch(db="nucleotide", id=assembly_id, retmode='xml')
    reference = Entrez.read(seq_record)
    return reference

In [16]:
reference = get_sars_cov_2_reference()

Fetch the reference for the spike protein, named S. Note that the name here and the name in the protein database do not match!

In [17]:
def get_protein_reference_by_name(reference, protein_name):
    results = list(filter(
        lambda r: r.get("GBFeature_key") == "CDS" and {'GBQualifier_name': 'gene', 'GBQualifier_value': protein_name} in r.get("GBFeature_quals"), 
        reference[0].get("GBSeq_feature-table")))
    assert len(results) == 1
    return results[0]

In [18]:
protein_name = "S"
protein_s_reference = get_protein_reference_by_name(reference, protein_name)
print(json.dumps(protein_s_reference, indent=3))

{
   "GBFeature_key": "CDS",
   "GBFeature_location": "21563..25384",
   "GBFeature_intervals": [
      {
         "GBInterval_from": "21563",
         "GBInterval_to": "25384",
         "GBInterval_accession": "MT184907.2"
      }
   ],
   "GBFeature_quals": [
      {
         "GBQualifier_name": "gene",
         "GBQualifier_value": "S"
      },
      {
         "GBQualifier_name": "codon_start",
         "GBQualifier_value": "1"
      },
      {
         "GBQualifier_name": "transl_table",
         "GBQualifier_value": "1"
      },
      {
         "GBQualifier_name": "product",
         "GBQualifier_value": "surface glycoprotein"
      },
      {
         "GBQualifier_name": "protein_id",
         "GBQualifier_value": "QIJ96463.1"
      },
      {
         "GBQualifier_name": "translation",
         "GBQualifier_value": "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSA

In [19]:
protein_s_sequence = list(filter(lambda x: x.get("GBQualifier_name") == "translation", protein_s_reference.get("GBFeature_quals")))[0].get("GBQualifier_value")
protein_s_sequence

'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITG

### Perform pairwise alignment

In [20]:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment

for p in protein_sequences[0:3]:
    print("Protein alignments:")
    alignments = pairwise2.align.globalxx(protein_s_sequence, p)
    print(format_alignment(*alignments[0]))

Protein alignments:
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGS-TPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQD-VNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLN

## Download SARS-CoV-2 DNA sequences <a class="anchor" id="dna-sequences"></a>

We can use Entrez to query SRA. Unfortunately, the download needs to be done with a different tool. In any case, Entrez can be used to track the existing new sequences and fetch the right xRR codes to download the data.

The default tool to download data from SRA is the SRA toolkit, which is available as a command line tool. The Python package pyrsadb (https://github.com/saketkc/pysradb) was explored, but it does not seem to work as expected.

### Track SRA samples with Entrez

Below we show how to query SRA for existing datasets given a search criteria.

In [21]:
def search_sra(retmax=5, retstart=0):
    database = "sra"
    search_term = "(sars-cov-2[Organism])"
    handle = Entrez.esearch(db=database, term=search_term, retmax=retmax, retstart=retstart)
    search_results = Entrez.read(handle)
    handle.close()
    return search_results

In [22]:
search_results = search_sra(retmax=5, retstart=0)
print(json.dumps(search_results, indent=3))
sra_identifiers = search_results.get("IdList")

{
   "Count": "149147",
   "RetMax": "5",
   "RetStart": "0",
   "IdList": [
      "12547333",
      "12547332",
      "12547331",
      "12547330",
      "12547329"
   ],
   "TranslationSet": [
      {
         "From": "sars-cov-2[Organism]",
         "To": "\"Severe acute respiratory syndrome coronavirus 2\"[Organism]"
      }
   ],
   "TranslationStack": [
      {
         "Term": "\"Severe acute respiratory syndrome coronavirus 2\"[Organism]",
         "Field": "Organism",
         "Count": "149147",
         "Explode": "Y"
      },
      "GROUP"
   ],
   "QueryTranslation": "\"Severe acute respiratory syndrome coronavirus 2\"[Organism]"
}


In [23]:
def get_sequence_metadata(identifier):
    database = "sra"
    handle = Entrez.esummary(db=database, id=identifier, retmode="full", rettype="xml")
    entry = Entrez.read(handle)
    handle.close()
    return entry

In [24]:
sra_entry_metadata = get_sequence_metadata(sra_identifiers[0])

print(json.dumps(sra_entry_metadata, indent=3))

[
   {
      "Item": [],
      "Id": "12547333",
      "ExpXml": "<Summary><Title>SARS-CoV-2 patient specimen</Title><Platform instrument_model=\"Illumina MiSeq\">ILLUMINA</Platform><Statistics total_runs=\"1\" total_spots=\"35857\" total_bases=\"16858817\" total_size=\"6691344\" load_done=\"true\" cluster_name=\"public\"/></Summary><Submitter acc=\"SRA1164901\" center_name=\"Quest Diagnostics\" contact_name=\"Ron M Kagan\" lab_name=\"Infectious Diseases\"/><Experiment acc=\"SRX9603372\" ver=\"1\" status=\"public\" name=\"SARS-CoV-2 patient specimen\"/><Study acc=\"SRP267191\" name=\"Severe acute respiratory syndrome coronavirus 2 Genome sequencing\"/><Organism taxid=\"2697049\" ScientificName=\"Severe acute respiratory syndrome coronavirus 2\"/><Sample acc=\"SRS7807085\" name=\"\"/><Instrument ILLUMINA=\"Illumina MiSeq\"/><Library_descriptor><LIBRARY_NAME>FL-QDX-1775</LIBRARY_NAME><LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY><LIBRARY_SOURCE>VIRAL RNA</LIBRARY_SOURCE><LIBRARY_SELECTION>RT-P

In [25]:
import xmltodict

def get_sequence(identifier):
    handle = Entrez.efetch(db="sra", id=identifier, rettype="xml", retmode="full")
    sra_sequence = handle.read()
    handle.close()
    return xmltodict.parse(sra_sequence)

In [26]:
sra_entry = get_sequence(sra_identifiers[1])
print(json.dumps(sra_entry, indent=3))

{
   "EXPERIMENT_PACKAGE_SET": {
      "EXPERIMENT_PACKAGE": {
         "EXPERIMENT": {
            "@accession": "SRX9603371",
            "@alias": "FL-QDX-1707",
            "IDENTIFIERS": {
               "PRIMARY_ID": "SRX9603371"
            },
            "TITLE": "SARS-CoV-2 patient specimen",
            "STUDY_REF": {
               "@accession": "SRP267191",
               "IDENTIFIERS": {
                  "PRIMARY_ID": "SRP267191",
                  "EXTERNAL_ID": {
                     "@namespace": "BioProject",
                     "#text": "PRJNA631061"
                  }
               }
            },
            "DESIGN": {
               "DESIGN_DESCRIPTION": "reverse transcription; ARTIC amplicon; sequencing; bwa; iVar",
               "SAMPLE_DESCRIPTOR": {
                  "@accession": "SRS7807084",
                  "IDENTIFIERS": {
                     "PRIMARY_ID": "SRS7807084",
                     "EXTERNAL_ID": {
                        "@namespace": "B