## Before you run this notebook

Start a terminal in Jupyter and enter the following commands:

```bash
git clone https://github.com/NCBI-Codeathons/pubmed-codeathon-team4.git
cd pubmed-codeathon-team4
pip install -r requirements.txt
```

In [1]:
from bmcodeathon.team4 import EUtils, print_element

### Warning
To run, the code uses an API key associated with an email address. Please alter the code to use your own. I will change my key after the codeathon.

In [2]:
eutils = EUtils(
    '8d4c4f67f2a663e9d0ef6ed4d60a4eedd609',               # API key
    'dansmood@gmail.com',                                 # Email address - unused
    10,                                                   # API calls per second
    'https://eutilspreview.ncbi.nlm.nih.gov/entrez'       # URL prefix for preview - normally not needed
)

In [3]:
eutils.apikey

'8d4c4f67f2a663e9d0ef6ed4d60a4eedd609'

### Summary of API style
This utility uses the python library [requests](https://requests.readthedocs.io/en/latest/).
As a design choice, the API returns requests results objects, but adds methods to the result
to convert the content.

In [4]:
r = eutils.einfo('pubmed')
r.status_code

200

So, if the "Content-Type" response header is xml, you can use the `xml` method.

In [5]:
r.headers['Content-Type'].startswith('text/xml')

True

In [6]:
r.xml()

<lxml.etree._ElementTree at 0x7f143800cd80>

Unfortunately, I don't know how to render that as a browseable XML element.  The best I know how to do is print it out on the console.

In [7]:
print_element(r.xml())

<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20190110//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20190110/einfo.dtd">
<eInfoResult>
	<DbInfo>
    <DbName>pubmed</DbName>
    <MenuName>PubMed</MenuName>
    <Description>PubMed bibliographic record</Description>
    <DbBuild>Build-2022.05.24.11.43</DbBuild>
    <Count>34130421</Count>
    <LastUpdate>2022/05/24 11:43</LastUpdate>
    <FieldList>
      <Field>
        <Name>ALL</Name>
        <FullName>All Fields</FullName>
        <Description>All terms from all searchable fields</Description>
        <TermCount/>
        <IsDate>N</IsDate>
        <IsNumerical>N</IsNumerical>
        <SingleToken>N</SingleToken>
        <Hierarchy>N</Hierarchy>
        <IsHidden>N</IsHidden>
      </Field>
      <Field>
        <Name>UID</Name>
        <FullName>UID</FullName>
        <Description>Unique number assigned to publication</Description>
        <TermCount/>
        <IsDate>N</IsDate>
        <IsNumerical>Y</IsNumerical>
        <Sin

### esearch
Let's do a search using this utility in the relevance sort order

In [8]:
r = eutils.esearch('pubmed', term='African Americans', sort='relevance')
r.raise_for_status()
assert r.headers['Content-Type'].startswith('text/xml')
doc = r.xml()
print_element(doc)

<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult>
  <Count>71574</Count>
  <RetMax>20</RetMax>
  <RetStart>0</RetStart>
  <IdList>
<Id>9366634</Id>
<Id>31465680</Id>
<Id>30585909</Id>
<Id>10540593</Id>
<Id>33617701</Id>
<Id>25754929</Id>
<Id>25612227</Id>
<Id>32451221</Id>
<Id>29340703</Id>
<Id>23197118</Id>
<Id>26980862</Id>
<Id>27792475</Id>
<Id>33617706</Id>
<Id>27193774</Id>
<Id>15073466</Id>
<Id>32583690</Id>
<Id>26018864</Id>
<Id>15303080</Id>
<Id>33704773</Id>
<Id>12623690</Id>
</IdList>
  <TranslationSet>
    <Translation>     <From>African Americans</From>     <To>"african americans"[MeSH Terms] OR ("african"[All Fields] AND "americans"[All Fields]) OR "african americans"[All Fields]</To>    </Translation>
  </TranslationSet>
  <QueryTranslation>"african americans"[MeSH Terms] OR ("african"[All Fields] AND "americans"[All Fields]) OR "african americans"[All Fields]</QueryTranslat

### Extracting the PMIDs

The `doc` returned from `r.xml()` is an lxml ElementTree, and so you can use XPath expressions on it to get Python lists of elements.

In [9]:
pmids = [element.text for element in doc.xpath('//IdList/Id')]
pmids

['9366634',
 '31465680',
 '30585909',
 '10540593',
 '33617701',
 '25754929',
 '25612227',
 '32451221',
 '29340703',
 '23197118',
 '26980862',
 '27792475',
 '33617706',
 '27193774',
 '15073466',
 '32583690',
 '26018864',
 '15303080',
 '33704773',
 '12623690']

### efetch

Now we can fetch these with efetch.  The API uses the `*args` argument of `efetch` as a list of identifiers, so that it is super easy and convenient to get just one.  Folks unfamiliar with Python should note how to call with a list you already have - prefix the argument with a "*".

In [10]:
r = eutils.efetch('pubmed', *pmids, rettype='xml')
r.raise_for_status()
assert r.headers['Content-Type'].startswith('text/xml')
doc = r.xml()
print_element(doc)

<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
  <PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
      <PMID Version="1">9366634</PMID>
      <DateCompleted>
        <Year>1997</Year>
        <Month>12</Month>
        <Day>04</Day>
      </DateCompleted>
      <DateRevised>
        <Year>2022</Year>
        <Month>04</Month>
        <Day>10</Day>
      </DateRevised>
      <Article PubModel="Print">
        <Journal>
          <ISSN IssnType="Print">0090-0036</ISSN>
          <JournalIssue CitedMedium="Print">
            <Volume>87</Volume>
            <Issue>11</Issue>
            <PubDate>
              <Year>1997</Year>
              <Month>Nov</Month>
            </PubDate>
          </JournalIssue>
          <Title>American journal of public health</Title>
          <ISOAbbreviation>Am J Public Health</ISOAbbreviation>
        </Journal>
        

Now we use xpath again to verify that E-Utilities returned exactly what we asked for.

In [11]:
returned_pmids = [element.text for element in doc.xpath('//PubmedArticle/MedlineCitation/PMID')]
pmids == returned_pmids

True