# Cheat Sheet 3

---

## Exercise 1
### Requests Module & The Entrez API
In this first exercise, we're tasked with using the request module to find some papers on PubMed through the Entrez API. This process will require that we use several modules, so let's start by importing them:

In [75]:
import requests
import xml.etree.ElementTree as ET
import time
import itertools
from pprint import pprint as pp

Let's find and parse some data about covid-19 articles in this way. First we have to build our URL according to the Entrez API specifications:

In [45]:
search_term = "stroke"
year = 2020
retmax = 20
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
parameters = f"?db=pubmed&retmax={retmax}&term={search_term}+AND+{year}[pdat]"
url = base_url + parameters

The base URL specificies the website and endpoint we would like to request our data from. The parameters allow us to tell this endpoint exactly what we're looking for, including what database we'd like to access (db=pubmed), how many articles we'd like to see (redmax=20), what and articles we'd like to search for (term=Coronavirus+AND+2019). 

In [46]:
r = requests.get(url)

The function requests.get() should return the server's response to your request.
Let's have a look at what this response is

In [52]:
print(r)

<Response [200]>


In [68]:
# add dir? help? attr?

Printing r directly to the console gives us some vague description of a response object with code 200. This is an HTTP response status code, which tells us whether or not our request was succesful. The code 200 means "OK", which is a good sign that our request went through. You can find what other codes mean [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), just in case you get something else.

Now that we have a response object, we can start extracting information. For example, we can get the status code from the object's properties:

In [48]:
r.status_code

200

We can also get the content of the response using the following:

In [49]:
content = r.text
print(content)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>30633</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdList>
<Id>35353466</Id>
<Id>35342805</Id>
<Id>35279226</Id>
<Id>35130113</Id>
<Id>35074187</Id>
<Id>35002168</Id>
<Id>34950386</Id>
<Id>34950327</Id>
<Id>34898801</Id>
<Id>34888207</Id>
<Id>34795974</Id>
<Id>34765407</Id>
<Id>34752535</Id>
<Id>34732927</Id>
<Id>34728943</Id>
<Id>34720138</Id>
<Id>34713244</Id>
<Id>34713060</Id>
<Id>34695217</Id>
<Id>34654533</Id>
</IdList><TranslationSet><Translation>     <From>stroke</From>     <To>"stroke"[MeSH Terms] OR "stroke"[All Fields]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>"stroke"[MeSH Terms]</Term>    <Field>MeSH Terms</Field>    <Count>159133</Count>    <Explode>Y</Explode>   </TermSet>   <TermSet>    <Term>"stroke"[All Fields]</Term>    <Field>All Field

The above printout statement clearly indicates that our data is formatted as an XML string. We can use the xml.etree.ElementTree module to parse this XML string into a DOM and extract our desired pubmed ID's (see Cheat Sheet 2, if you need a refresher on parsing XML):

In [50]:
tree = ET.ElementTree(ET.fromstring(content))
root = tree.getroot()
ids = [Id.text for Id in root.iter('Id')]

In [51]:
print(ids)

['35353466', '35342805', '35279226', '35130113', '35074187', '35002168', '34950386', '34950327', '34898801', '34888207', '34795974', '34765407', '34752535', '34732927', '34728943', '34720138', '34713244', '34713060', '34695217', '34654533']


You may find it useful to adapt the code above into a function that returns some pubmed ID's based on search parameters. This will help in case you have to search for several different topics.

Now that we have our paper ID's, we will need to make another request to **a different endpoint** to get some metadata back. We can ask for the metadata from multiple papers at once by specifiying setting id search parameter to be a collection of pubmed ID's separated by commas. 

**WARNING: The Entrez API only allows a limited length of URL, so you may have to request papers in small batches. If this is the case, you absolutely must space out the requests you send to the server using the time module, or else they will revoke your IP address's access to data. Below, I'm using the time module to spread out multiple runs through the same loop:**

In [66]:
for i in range(10):
    time.sleep(1)
    print(i)

0
1
2
3
4
5
6
7
8
9


In [37]:
id_string = ",".join(ids)# joins list of ids to comma separated string

base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
parameters =f"?db=pubmed&retmode=xml&id={id_string}"

url = base_url + parameters

That warning aside, let's get some data from

In [40]:
r = requests.get(url)

In [44]:
print(r.text[0:1000]) #only prints out the first thousand characters of our metadata

<?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet><PubmedArticle><MedlineCitation Status="MEDLINE" IndexingMethod="Automated" Owner="NLM"><PMID Version="1">35516164</PMID><DateCompleted><Year>2022</Year><Month>05</Month><Day>09</Day></DateCompleted><DateRevised><Year>2022</Year><Month>05</Month><Day>09</Day></DateRevised><Article PubModel="Electronic-eCollection"><Journal><ISSN IssnType="Electronic">2399-4908</ISSN><JournalIssue CitedMedium="Internet"><Volume>5</Volume><Issue>4</Issue><PubDate><Year>2020</Year></PubDate></JournalIssue><Title>International journal of population data science</Title><ISOAbbreviation>Int J Popul Data Sci</ISOAbbreviation></Journal><ArticleTitle>Estimating surge in COVID-19 cases, hospital resources and PPE demand with the interactive and locally-informed <i>COVID-19 Health System Capacity Planning Tool</i>.</ArticleTitle><

Again, our response text is just a string containing XML-formatted metadata. Since we have already gone over XML parsing, I will leave it to you to figure out the internal structure of the document and extract some information using the ElementTree module.

**Hint**: for XML parsing for very large, complex, deeply nested XML responses such as this, you might find it useful to write your XML data to a file and look at it using an editor with syntax highlighting to get a broad overview of its structure. This is especially useful when you're trying to write code that pulls specific information out XML.

---

### JavaScript Object Notation (JSON)

Like XML, JSON is a data format that is commonly used to pass around and store structured information. Like XML, JSON is flexible and capable of storing complex, nested information.

Here's an example of JSON:

```json
{
    "name": "Voldemort",
    "evil?": true,
    "birth year": 1926,
    "henchman": [
                     "Lucius Malfoy", 
                     "Severus Snape", 
                     "Belatrix Lestrange",
                     "Peter Pettigrew"
                ],
    "facial features":  {
                            "nose": null,
                            "skin": "ghastly"
                        }  
}
```

JSON has two basic structural components that allow us to organize this information: objects and arrays.

**Objects** are very similar in structure to python dictionaries. Like dictionaries, objects store information using key-value pairs. While dictionary keys can be any immutable type, objects can only use strings as keys. Object values, on the other hand, can be strings, numbers, booleans, nulls, arrays, and even other objects.

The syntax of an object is quite simple, and is described by 
1. Objects are delimited by curly brackets **{  }**. 

2. An individual key-value pair is stored as **"key": value**, with a colon separating the key and value.

3. Key-value pairs are separated from each other by commas

3. Whitespace and new lines do not affect the structure, so we can use them as we see fit for readability

Putting these rules together, here's what a basic object looks like:

```json
{
    "key1": "value1",
    "key2": 2,
    "key3": true,
    "key4": null
}
```

---

**Arrays** are very similar in structure to python lists. Like lists, they store an ordered collection of data across a range of integer indices. We often find that all the members of a given array are of the same type, but this is not always the case. Arrays can store strings, numbers, booleans, nulls, objects, and other arrays.

Array syntax is also very simple:
1. Arrays are delimited by square brackets **[   ]**

2. Elements inside an array are separated by commas

3. Whitespace and new lines do not affect the structure, so we can use them as we see fit for readability

Putting these rules together, here's a basic JSON array:

```json
    [
        1,
        2,
        3,
        4,
        5
    ]
```

A few more notes on JSON syntax:
1. Short objects and arrays can be written inline like so:
    - ```json 
        {"name": "tom", "age": 24}
      ```
    - ```json 
        [1, 2, 3]
      ```
2. For readability, larger objects and arrays should use line separation and tabbing:
    - ```json 
        {
            "name": "tom", 
            "age": 24,
            "hobbies": ["reading",
                        "long walks on the beach",
                        "cooking",
                        "pickleball"]
         }
      ```
3. JSON strings must be delimited with double quotes, whereas python strings can use either double or single quotes

---

Now that we understand what JSON is, how do we use it in Python? We'll start by importing python's internal json module:

In [69]:
import json

The json module automatically converts JSON structures to their Python equivalents and vice versa. Below is a conversion table that describes these data type equivalences

|JSON | Python|
|-----|-------|
|object <br> {"hello": "world"}|dictionary <br> {'hello': "world"}|
|array <br> [1,2,3] |list <br> [1,2,3]|
|string<br>"mystring" | string <br><ul><li>"mystring"</li><li>'mystring'</li></ul>|
|number<br> 5| int/long <br> 5|
|number<br> 3.14 | float <br> 3.14|
|Boolean<br><ul><li>true</li><li>false</li></ul> | bool <br><ul><li>True</li><li>False</li></ul>|
|null| None|

JSON is often stored and passed as plain text. Here we will use the json module to parse the plain text JSON object at the top of the section:

In [77]:
#block quotes like this allow us to see the indenting and line breaks inside a string.
json_text = '''{
    "name": "Voldemort",
    "evil?": true,
    "birth year": 1926,
    "henchman": [
                     "Lucius Malfoy", 
                     "Severus Snape", 
                     "Belatrix Lestrange",
                     "Peter Pettigrew"
                ],
    "facial features":  {
                            "nose": null,
                            "skin": "ghastly"
                        }  
}
'''


json_dict = json.loads(json_text)
pp(json_dict)

{'birth year': 1926,
 'evil?': True,
 'facial features': {'nose': None, 'skin': 'ghastly'},
 'henchman': ['Lucius Malfoy',
              'Severus Snape',
              'Belatrix Lestrange',
              'Peter Pettigrew'],
 'name': 'Voldemort'}


As we can see, the json.loads() function will parse a string containing JSON into a nested structure of python dictionaries and lists. To go in the other direction, we use the json.dumps() function. This direction requires an extra choice from the programmer: how do we want to format the text? There's not necessarily one right answer, just pick formatting parameters that make sense for you or your team.

In [81]:
new_text = json.dumps(json_dict, indent=4, sort_keys=True)
print(new_text)

{
    "birth year": 1926,
    "evil?": true,
    "facial features": {
        "nose": null,
        "skin": "ghastly"
    },
    "henchman": [
        "Lucius Malfoy",
        "Severus Snape",
        "Belatrix Lestrange",
        "Peter Pettigrew"
    ],
    "name": "Voldemort"
}


In case you have collected, processed, and formatted large quantities of data from an API, you should probably save that data for later use. This way, we don't have to bother the API for the same data more than once (very important), and you don't have to process it again. As an added bonus, this makes our data pipeline more **modular**, which helps us break the problem up into separate steps. We can write JSON to a file by using the json.dump() function: 

In [83]:
with open("./voldemort.json", "w") as json_file:
    json.dump(json_dict, json_file)

Check to see that "voldemort.json" has been added to the folder containing this file.

When we're ready to come back and use our data, we can read the file using the json.load() function:

In [85]:
with open("./voldemort.json", "r") as json_file:
    dict_from_file = json.load(json_file)
    pp(dict_from_file)

{'birth year': 1926,
 'evil?': True,
 'facial features': {'nose': None, 'skin': 'ghastly'},
 'henchman': ['Lucius Malfoy',
              'Severus Snape',
              'Belatrix Lestrange',
              'Peter Pettigrew'],
 'name': 'Voldemort'}


WARNING: it is easy confuse the purpose of json.load() versus json.loads() and json.dump() versus json.dumps(). I find it helpful to pretend that the "S" in "loads" and "dumps" stands for "String", so these functions convert directly between **string** objects and dictionaries. By process of elimination, we can conclude that load and dump do not convert directly to strings, so they must interact with files instead.

## Exercise 2

By now, we have gone over most of the tools for you to complete this problem on your own. If you get stuck, refer back to the section on basic python and pandas from cheatsheet1

---

## Exercise 3

Now we are going to get into the nitty gritty of machine learning. As you may know, machine learning is broadly used to describe algorithms that iteratively improve some model using data. As the model consumes more data, it tends to become more accurate, so we say that the machine "learns". This technique can take many different forms and create models for virtually any imaginable purpose. Roughly speaking, there are two coarse categories that we can separate machine learning algorithms into: supervised and unsupervised. 

**A supervised learning algorithm** learns by training on labeled data, which is comprised of many input-output pairs. At each iteration, the model takes an input and makes a prediciton on what the output should be. The learning algorithm then compares this predicted output with the actual output that was paired with the input in the label data set. Based on this comparison, the learning algorithm adjusts the model's internal parameters slightly in whatever direction would best improve the prediction to be closer to the actual output. By iterating over this process many times, the model will gradually have more and more accurate predictions. Thus supervised learning tends to be used to generate predictive models.

**An unsupervised learning** learns by training on unlabeled data. Generally, these models attempt to determine some salient property of the dataset. The type of property we want to extract, and the algorithm we choose to extract it will largely depend on our use case. For example, if we want to discover naturally occuring clusters in our data, we might use the k-means algorithm. If we want to simplify a dataset with many variables, we might instead choose truncated singular value decomposition (SVD) to pick out features that best explain variability.

### Transformers

---

**Transformers** are a special type of artificial neural network that process sequential data such as text, images, or video. Like all neural networks, transformers are made up of layers of **neurons**, which are linked up by weighted connections. Also like all neural networks, transformers are primarily trained using a supervised learning algorithm called **backpropogation**, which iteratively tunes the weights of each connection starting from the output layer and working backwards to the input layer. In particular, transformers have an internal mechanism called self-attention that allow them to decide which parts of the data are most important, and give those parts of the data more weight in determining the output.

Transformers are often used to embed human-readable text into high dimensional vectors, which incredibly encode semantic information into geometric and algebraic relationships. For example, in the embedding given by the word2vec transformer, we might find that "King" - "Man" + "Woman" = "Queen", or that "Actor" - "Talent" + "Ego" = "Jay Leto".

![](embedding.png)

In natural language processing, we will often use a transformer to map text to vectors as the first step in a larger model-training pipeline. In this exercise, you are tasked with using a pretrained transformer, SPECTER, to embed metadata text before using that embedding to train two other models.

In [None]:
from transformers import

### Principal Component Analysis

---

### Linear Discriminant Analysis

---

---

## Exercise 4

Here you are tasked with describing how to parallelize the merge sort algorithm using two processes. Before you try to parallelize it, let's try to get a good understanding of how the merge sort works in general. Suppose you would like to sort the following list of values into assending order:

```json
[5, 7, 6, 1]
```

We start by splitting the list into two equal halfs, effectively by making a copy of each half:

```json
[5, 7, 6, 1]
```
$\hspace{3.5em}\swarrow\searrow$
```json
[[5,7],[6,1]]
```

We continue to recursively break up each sub-list one at a time until we encounter two lists containing at most one element each. Breaking up the first sublist above, we arrive at the following:

```json
[5, 7]
```

$\hspace{2em}\swarrow\searrow$

```json
[[5],[7]]
```

Now that we have two sublists of size 1, it is time to merge. We accomplish this by comparing the first element from each sublist. Whichever is smaller goes is removed from the split sublist and appended to the end of the sublist.

```json
{
    split: [[5],[7]],
    merged: []
}
```

$\hspace{4em}\Downarrow$ 5 is removed from split and added to merged

```json
{
    split: [[],[7]],
    merged: [5]
}
```

Now we try to compare the first element of each of the split sublists. Since the first sublist has run out of elements, we have to put the rest of them in the merged sublist. We've already compared the smallest element of the second sublist with all of the elements of the merged list, so we can conclude that the the second sublist can get added to the end of the merged list

```json
{
    split: [[],[7]],
    merged: [5]
}
```

$\hspace{4em}\Downarrow$ remaining elements of our second sublist are appended to the merged list

```json
{
    split: [[],[]],
    merged: [5,7]
}
```

Now that the first pair of single-element sublists are merged, we rinse and repeat with the next sublist:

### Split
```json
[6, 1]
```

$\hspace{2em}\swarrow\searrow$

```json
[[6],[1]]
```

### Merge

```json
{
    split: [[6],[1]],
    merged: []
}
```

$\hspace{4em}\Downarrow$ 1 is removed from split and added to merged

```json
{
    split: [[6],[]],
    merged: [1]
}
```

$\hspace{4em}\Downarrow$ remaining elements of our first sublist are appended to the merged list

```json
{
    split: [[],[]],
    merged: [1,6]
}
```


Now all of the single element sublists have been merged into sorted lists. We start merging these larger lists, which is only slightly more complicated:

```json
{
    split: [[5,7],[1,6]],
    merged: []
}
```

We compare the first element of each split sublist. Whichever is smaller goes in the merged sublist first:

```json
{
    split: [[5,7],[1,6]],
    merged: []
}
```

$\hspace{4em}\Downarrow$ 1 is removed from split and added to merged

```json
{
    split: [[5,7],[6]],
    merged: [1]
}
```

Now the second split sublist has a new smallest element, so we have to make the comparison again.

```json
{
    split: [[5,7],[6]],
    merged: [1]
}
```

$\hspace{4em}\Downarrow$ 5 is removed from split and added to merged

```json
{
    split: [[7],[6]],
    merged: [1,5]
}
```

Repeating this step one more time, we get:

```json
{
    split: [[7],[6]],
    merged: [1,5]
}
```
$\hspace{4em}\Downarrow$ 6 is removed from split and added to merged

```json
{
    split: [[7],[]],
    merged: [1,5,6]
}
```

Again, one of our split sublists has run out of elements, so we have to merge the rest of the other sublist into the merged list. Since the smallest element of this split sublist has already been compared with the largest element of the merged sublist, every element of the split sublist is larger than every element of the merged sublist. Hence, we just append all the elements from the split sublist to the end of the merged sublist:

```json
{
    split: [[7],[]],
    merged: [1,5,6]
}
```

$\hspace{4em}\Downarrow$ remaining elements of split sublist appended to the end of merged

```json
{
    split: [[],[]],
    merged: [1,5,6,7]
}
```

And so our list is sorted, and our task is done. Note that the 

## Exercise 5