# Cheat Sheet 3

## Exercise 1
In this first exercise, we're tasked with using the request module to find some papers on PubMed through the Entrez API. This process will require that we use several modules, so let's start by importing them:

In [26]:
import requests
import xml.etree.ElementTree as ET
import time
import itertools

Let's find and parse some data about covid-19 articles in this way. First we have to build our URL according to the Entrez API specifications:

In [20]:
search_term = "Coronavirus"
year = 2020
retmax = 20
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
parameters = f"?db=pubmed&retmax={retmax}&term={search_term}+AND+{year}[pdat]"
url = base_url + parameters

The base URL specificies the website and endpoint we would like to request our data from. The parameters allow us to tell this endpoint exactly what we're looking for, including what database we'd like to access (db=pubmed), how many articles we'd like to see (redmax=20), what and articles we'd like to search for (term=Coronavirus+AND+2019). 

In [21]:
r = requests.get(url)

The function requests.get() should return the server's response to your request.
Let's have a look at what this response is

In [22]:
print(r)

<Response [200]>


Printing r directly to the console gives us some vague description of a response object with code 200. This is an HTTP response status code, which tells us whether or not our request was succesful. The code 200 means "OK", which is a good sign that our request went through. You can find what other codes mean [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), just in case you get something else.

Now that we have a response object, we can start extracting information. For example, we can get the status code from the object's properties:

In [23]:
r.status_code

200

We can also get the content of the response using the following:

In [24]:
content = r.text
print(content)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>71924</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdList>
<Id>35516164</Id>
<Id>35512053</Id>
<Id>35512041</Id>
<Id>35512040</Id>
<Id>35512038</Id>
<Id>35484870</Id>
<Id>33117894</Id>
<Id>35349206</Id>
<Id>35330938</Id>
<Id>35169637</Id>
<Id>35154373</Id>
<Id>35145562</Id>
<Id>35141720</Id>
<Id>35141430</Id>
<Id>35137816</Id>
<Id>35126007</Id>
<Id>35125998</Id>
<Id>35097224</Id>
<Id>35097223</Id>
<Id>35090689</Id>
</IdList><TranslationSet><Translation>     <From>Coronavirus</From>     <To>"coronavirus"[MeSH Terms] OR "coronavirus"[All Fields]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>"coronavirus"[MeSH Terms]</Term>    <Field>MeSH Terms</Field>    <Count>137184</Count>    <Explode>Y</Explode>   </TermSet>   <TermSet>    <Term>"coronavirus"[All Fields]</

The above printout statement clearly indicates that our data is formatted as an XML string. We can use the xml.etree.ElementTree module to parse this XML string into a DOM and extract our desired pubmed ID's (see Cheat Sheet 2, if you need a refresher on parsing XML):

In [34]:
tree = ET.ElementTree(ET.fromstring(content))
root = tree.getroot()
ids = [Id.text for Id in root.iter('Id')]

In [35]:
print(ids)

['35516164', '35512053', '35512041', '35512040', '35512038', '35484870', '33117894', '35349206', '35330938', '35169637', '35154373', '35145562', '35141720', '35141430', '35137816', '35126007', '35125998', '35097224', '35097223', '35090689']


You may find it useful to adapt the code above into a function that returns some pubmed ID's based on search parameters. This will help in case you have to search for several different topics.

Now that we have our paper ID's, we will need to make another request to **a different endpoint** to get some metadata back. We can ask for the metadata from multiple papers at once by specifiying setting id search parameter to be a collection of pubmed ID's separated by commas. The Entrez API only allows a limited length of URL, so you may have to request papers in smaller batches. If this is the case, you **absolutely must** space out the requests you send to the server using the time module, or else they will revoke your IP address's access to data.

In [37]:
id_string = ",".join(ids)

base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
parameters =f"?db=pubmed&retmode=xml&id={id_string}"

url = base_url + parameters

In [40]:
r = requests.get(url)

In [44]:
print(r.text[0:1000])

<?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet><PubmedArticle><MedlineCitation Status="MEDLINE" IndexingMethod="Automated" Owner="NLM"><PMID Version="1">35516164</PMID><DateCompleted><Year>2022</Year><Month>05</Month><Day>09</Day></DateCompleted><DateRevised><Year>2022</Year><Month>05</Month><Day>09</Day></DateRevised><Article PubModel="Electronic-eCollection"><Journal><ISSN IssnType="Electronic">2399-4908</ISSN><JournalIssue CitedMedium="Internet"><Volume>5</Volume><Issue>4</Issue><PubDate><Year>2020</Year></PubDate></JournalIssue><Title>International journal of population data science</Title><ISOAbbreviation>Int J Popul Data Sci</ISOAbbreviation></Journal><ArticleTitle>Estimating surge in COVID-19 cases, hospital resources and PPE demand with the interactive and locally-informed <i>COVID-19 Health System Capacity Planning Tool</i>.</ArticleTitle><

Now it's up to us to 

## Exercise 2

## Exercise 3

## Exercise 4

## Exercise 5