# Cheat Sheet 3

---

## Exercise 1
### Requests Module & The Entrez API
In this first exercise, we're tasked with using the request module to find some papers on PubMed through the Entrez API. This process will require that we use several modules, so let's start by importing them:

In [1]:
import requests
import xml.etree.ElementTree as ET
import time
import itertools
from pprint import pprint as pp

Let's find and parse some data about covid-19 articles in this way. First we have to build our URL according to the Entrez API specifications:

In [2]:
search_term = "stroke"
year = 2020
retmax = 20
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
parameters = f"?db=pubmed&retmax={retmax}&term={search_term}+AND+{year}[pdat]"
url = base_url + parameters

The base URL specificies the website and endpoint we would like to request our data from. The parameters allow us to tell this endpoint exactly what we're looking for, including what database we'd like to access (db=pubmed), how many articles we'd like to see (redmax=20), what and articles we'd like to search for (term=Coronavirus+AND+2019). 

In [3]:
r = requests.get(url)

The function requests.get() should return the server's response to your request.
Let's have a look at what this response is

In [4]:
print(r)

<Response [200]>


In [5]:
# add dir? help? attr?

Printing r directly to the console gives us some vague description of a response object with code 200. This is an HTTP response status code, which tells us whether or not our request was succesful. The code 200 means "OK", which is a good sign that our request went through. You can find what other codes mean [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status), just in case you get something else.

Now that we have a response object, we can start extracting information. For example, we can get the status code from the object's properties:

In [6]:
r.status_code

200

We can also get the content of the response using the following:

In [7]:
content = r.text
print(content)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>30647</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdList>
<Id>35615236</Id>
<Id>35353466</Id>
<Id>35342805</Id>
<Id>35279226</Id>
<Id>35130113</Id>
<Id>35074187</Id>
<Id>35002168</Id>
<Id>34950386</Id>
<Id>34950327</Id>
<Id>34898801</Id>
<Id>34888207</Id>
<Id>34795974</Id>
<Id>34765407</Id>
<Id>34752535</Id>
<Id>34732927</Id>
<Id>34728943</Id>
<Id>34720138</Id>
<Id>34713244</Id>
<Id>34713060</Id>
<Id>34695217</Id>
</IdList><TranslationSet><Translation>     <From>stroke</From>     <To>"stroke"[MeSH Terms] OR "stroke"[All Fields]</To>    </Translation></TranslationSet><TranslationStack>   <TermSet>    <Term>"stroke"[MeSH Terms]</Term>    <Field>MeSH Terms</Field>    <Count>159896</Count>    <Explode>Y</Explode>   </TermSet>   <TermSet>    <Term>"stroke"[All Fields]</Term>    <Field>All Field

The above printout statement clearly indicates that our data is formatted as an XML string. We can use the xml.etree.ElementTree module to parse this XML string into a DOM and extract our desired pubmed ID's (see Cheat Sheet 2, if you need a refresher on parsing XML):

In [8]:
tree = ET.ElementTree(ET.fromstring(content))
root = tree.getroot()
ids = [Id.text for Id in root.iter('Id')]

In [9]:
print(ids)

['35615236', '35353466', '35342805', '35279226', '35130113', '35074187', '35002168', '34950386', '34950327', '34898801', '34888207', '34795974', '34765407', '34752535', '34732927', '34728943', '34720138', '34713244', '34713060', '34695217']


You may find it useful to adapt the code above into a function that returns some pubmed ID's based on search parameters. This will help in case you have to search for several different topics.

Now that we have our paper ID's, we will need to make another request to **a different endpoint** to get some metadata back. We can ask for the metadata from multiple papers at once by specifiying setting id search parameter to be a collection of pubmed ID's separated by commas. 

**WARNING: The Entrez API only allows a limited length of URL, so you may have to request papers in small batches. If this is the case, you absolutely must space out the requests you send to the server using the time module, or else they will revoke your IP address's access to data. Below, I'm using the time module to spread out multiple runs through the same loop:**

In [10]:
for i in range(10):
    time.sleep(1)
    print(i)

0
1
2
3
4
5
6
7
8
9


In [11]:
id_string = ",".join(ids)# joins list of ids to comma separated string

base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
parameters =f"?db=pubmed&retmode=xml&id={id_string}"

url = base_url + parameters

That warning aside, let's get some data from

In [12]:
r = requests.get(url)

In [13]:
print(r.text[0:1000]) #only prints out the first thousand characters of our metadata

<?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet><PubmedArticle><MedlineCitation Status="PubMed-not-MEDLINE" Owner="NLM"><PMID Version="1">35615236</PMID><DateRevised><Year>2022</Year><Month>05</Month><Day>26</Day></DateRevised><Article PubModel="Print-Electronic"><Journal><ISSN IssnType="Print">1751-1437</ISSN><JournalIssue CitedMedium="Print"><Volume>23</Volume><Issue>2</Issue><PubDate><Year>2022</Year><Month>May</Month></PubDate></JournalIssue><Title>Journal of the Intensive Care Society</Title><ISOAbbreviation>J Intensive Care Soc</ISOAbbreviation></Journal><ArticleTitle>Cerebral oximetry in adult cardiac surgery to reduce the incidence of neurological impairment and hospital length-of-stay: A prospective, randomized, controlled trial.</ArticleTitle><Pagination><MedlinePgn>109-116</MedlinePgn></Pagination><ELocationID EIdType="doi" ValidYN="Y">10.

Again, our response text is just a string containing XML-formatted metadata. Since we have already gone over XML parsing, I will leave it to you to figure out the internal structure of the document and extract some information using the ElementTree module.

**Hint**: for XML parsing for very large, complex, deeply nested XML responses such as this, you might find it useful to write your XML data to a file and look at it using an editor with syntax highlighting to get a broad overview of its structure. This is especially useful when you're trying to write code that pulls specific information out XML.

---

### JavaScript Object Notation (JSON)

Like XML, JSON is a data format that is commonly used to pass around and store structured information. Like XML, JSON is flexible and capable of storing complex, nested information.

Here's an example of JSON:

```json
{
    "name": "Voldemort",
    "evil?": true,
    "birth year": 1926,
    "henchman": [
                     "Lucius Malfoy", 
                     "Severus Snape", 
                     "Belatrix Lestrange",
                     "Peter Pettigrew"
                ],
    "facial features":  {
                            "nose": null,
                            "skin": "ghastly"
                        }  
}
```

JSON has two basic structural components that allow us to organize this information: objects and arrays.

**Objects** are very similar in structure to python dictionaries. Like dictionaries, objects store information using key-value pairs. While dictionary keys can be any immutable type, objects can only use strings as keys. Object values, on the other hand, can be strings, numbers, booleans, nulls, arrays, and even other objects.

The syntax of an object is quite simple, and is described by 
1. Objects are delimited by curly brackets **{  }**. 

2. An individual key-value pair is stored as **"key": value**, with a colon separating the key and value.

3. Key-value pairs are separated from each other by commas

3. Whitespace and new lines do not affect the structure, so we can use them as we see fit for readability

Putting these rules together, here's what a basic object looks like:

```json
{
    "key1": "value1",
    "key2": 2,
    "key3": true,
    "key4": null
}
```

---

**Arrays** are very similar in structure to python lists. Like lists, they store an ordered collection of data across a range of integer indices. We often find that all the members of a given array are of the same type, but this is not always the case. Arrays can store strings, numbers, booleans, nulls, objects, and other arrays.

Array syntax is also very simple:
1. Arrays are delimited by square brackets **[   ]**

2. Elements inside an array are separated by commas

3. Whitespace and new lines do not affect the structure, so we can use them as we see fit for readability

Putting these rules together, here's a basic JSON array:

```json
    [
        1,
        2,
        3,
        4,
        5
    ]
```

A few more notes on JSON syntax:
1. Short objects and arrays can be written inline like so:
    - ```json 
        {"name": "tom", "age": 24}
      ```
    - ```json 
        [1, 2, 3]
      ```
2. For readability, larger objects and arrays should use line separation and tabbing:
    - ```json 
        {
            "name": "tom", 
            "age": 24,
            "hobbies": ["reading",
                        "long walks on the beach",
                        "cooking",
                        "pickleball"]
         }
      ```
3. JSON strings must be delimited with double quotes, whereas python strings can use either double or single quotes

---

Now that we understand what JSON is, how do we use it in Python? We'll start by importing python's internal json module:

In [14]:
import json

The json module automatically converts JSON structures to their Python equivalents and vice versa. Below is a conversion table that describes these data type equivalences

|JSON | Python|
|-----|-------|
|object <br> {"hello": "world"}|dictionary <br> {'hello': "world"}|
|array <br> [1,2,3] |list <br> [1,2,3]|
|string<br>"mystring" | string <br><ul><li>"mystring"</li><li>'mystring'</li></ul>|
|number<br> 5| int/long <br> 5|
|number<br> 3.14 | float <br> 3.14|
|Boolean<br><ul><li>true</li><li>false</li></ul> | bool <br><ul><li>True</li><li>False</li></ul>|
|null| None|

JSON is often stored and passed as plain text. Here we will use the json module to parse the plain text JSON object at the top of the section:

In [15]:
#block quotes like this allow us to see the indenting and line breaks inside a string.
json_text = '''{
    "name": "Voldemort",
    "evil?": true,
    "birth year": 1926,
    "henchman": [
                     "Lucius Malfoy", 
                     "Severus Snape", 
                     "Belatrix Lestrange",
                     "Peter Pettigrew"
                ],
    "facial features":  {
                            "nose": null,
                            "skin": "ghastly"
                        }  
}
'''


json_dict = json.loads(json_text)
pp(json_dict)

{'birth year': 1926,
 'evil?': True,
 'facial features': {'nose': None, 'skin': 'ghastly'},
 'henchman': ['Lucius Malfoy',
              'Severus Snape',
              'Belatrix Lestrange',
              'Peter Pettigrew'],
 'name': 'Voldemort'}


As we can see, the json.loads() function will parse a string containing JSON into a nested structure of python dictionaries and lists. To go in the other direction, we use the json.dumps() function. This direction requires an extra choice from the programmer: how do we want to format the text? There's not necessarily one right answer, just pick formatting parameters that make sense for you or your team.

In [16]:
new_text = json.dumps(json_dict, indent=4, sort_keys=True)
print(new_text)

{
    "birth year": 1926,
    "evil?": true,
    "facial features": {
        "nose": null,
        "skin": "ghastly"
    },
    "henchman": [
        "Lucius Malfoy",
        "Severus Snape",
        "Belatrix Lestrange",
        "Peter Pettigrew"
    ],
    "name": "Voldemort"
}


In case you have collected, processed, and formatted large quantities of data from an API, you should probably save that data for later use. This way, we don't have to bother the API for the same data more than once (very important), and you don't have to process it again. As an added bonus, this makes our data pipeline more **modular**, which helps us break the problem up into separate steps. We can write JSON to a file by using the json.dump() function: 

In [17]:
with open("./voldemort.json", "w") as json_file:
    json.dump(json_dict, json_file)

Check to see that "voldemort.json" has been added to the folder containing this file.

When we're ready to come back and use our data, we can read the file using the json.load() function:

In [18]:
with open("./voldemort.json", "r") as json_file:
    dict_from_file = json.load(json_file)
    pp(dict_from_file)

{'birth year': 1926,
 'evil?': True,
 'facial features': {'nose': None, 'skin': 'ghastly'},
 'henchman': ['Lucius Malfoy',
              'Severus Snape',
              'Belatrix Lestrange',
              'Peter Pettigrew'],
 'name': 'Voldemort'}


WARNING: it is easy confuse the purpose of json.load() versus json.loads() and json.dump() versus json.dumps(). I find it helpful to pretend that the "S" in "loads" and "dumps" stands for "String", so these functions convert directly between **string** objects and dictionaries. By process of elimination, we can conclude that load and dump do not convert directly to strings, so they must interact with files instead.

## Exercise 2

By now, we have gone over most of the tools for you to complete this problem on your own. If you get stuck, refer back to the section on basic python and pandas from cheatsheet1

---

## Exercise 3

Now we are going to get into the nitty gritty of machine learning. As you may know, machine learning is broadly used to describe algorithms that iteratively improve some model using data. As the model consumes more data, it tends to become more accurate, so we say that the machine "learns". This technique can take many different forms and create models for virtually any imaginable purpose. Roughly speaking, there are two coarse categories that we can separate machine learning algorithms into: supervised and unsupervised. 

**A supervised learning algorithm** learns by training on labeled data, which is comprised of many input-output pairs. At each iteration, the model takes an input and makes a prediciton on what the output should be. The learning algorithm then compares this predicted output with the actual output that was paired with the input in the label data set. Based on this comparison, the learning algorithm adjusts the model's internal parameters slightly in whatever direction would best improve the prediction to be closer to the actual output. By iterating over this process many times, the model will gradually have more and more accurate predictions. Thus supervised learning tends to be used to generate predictive models.

**An unsupervised learning** learns by training on unlabeled data. Generally, these models attempt to determine some salient property of the dataset. The type of property we want to extract, and the algorithm we choose to extract it will largely depend on our use case. For example, if we want to discover naturally occuring clusters in our data, we might use the k-means algorithm. If we want to simplify a dataset with many variables, we might instead choose truncated singular value decomposition (SVD) to pick out features that best explain variability.

### Transformers

---

**Transformers** are a special type of artificial neural network that process sequential data such as text, images, or video. Like all neural networks, transformers are made up of layers of **neurons**, which are linked up by weighted connections. Also like all neural networks, transformers are primarily trained using a supervised learning algorithm called **backpropogation**, which iteratively tunes the weights of each connection starting from the output layer and working backwards to the input layer. In particular, transformers have an internal mechanism called self-attention that allow them to decide which parts of the data are most important, and give those parts of the data more weight in determining the output.

Transformers are often used to embed human-readable text into high dimensional vectors, which incredibly encode semantic information into geometric and algebraic relationships. For example, in the embedding given by the word2vec transformer, we might find that "King" - "Man" + "Woman" = "Queen", or that "Actor" - "Talent" + "Ego" = "Jay Leto".

![](embedding.png)

In natural language processing, we will often use a transformer to map text to vectors as the first step in a larger model-training pipeline. In this exercise, you are tasked with using a pretrained transformer, SPECTER, to embed metadata text before using that embedding to train two other models. Let's try this on a dataset similar to the ones you generated in exercise 2.

In [19]:
with open('stroke.json') as file:
    stroke_dict = json.load(file)

Now that we have some data, let's download and import the tools we will need to parse and embed our text data:

In [20]:
#download transformers using conda in terminal
!conda install -c huggingface transformers -y

Collecting package metadata (current_repodata.json): done
Solving environment: | 
  - anaconda/linux-64::certifi-2021.10.8-py39h06a4308_0
  - defaults/linux-64::certifi-2021.10.8-py39h06a4308done

# All requested packages already installed.



In [24]:
#import tools
from transformers import AutoTokenizer, AutoModel
import math

# load tokenizer, which parses texts into tokens, which can be fed into the model
tokenizer = AutoTokenizer.from_pretrained('allenai/specter')

#load model, which is used to embed our text data as vectors
model = AutoModel.from_pretrained('allenai/specter')

Now that we have our tools imported, let's use them to create our embeddings. The result of the following cell, embeddings, is a numpy array where the i'th element of the array is a 768-dimensional vector that embeds the i'th paper:

In [29]:
data = []
papers = list(stroke_dict.values())
batch_size = 100
batch_num = math.floor(len(papers)/batch_size)


for i in range(batch_num):
    data.append( [paper["title"] + tokenizer.sep_token + paper["abstract"] for paper in papers[batch_size*i : batch_size*(i+1)]])
    
inputs = tokenizer(data, padding=True, truncation=True, return_tensors="pt", max_length=512)
result = model(**inputs)
# take the first token in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :].detach().numpy()

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

## PCA and LCA:

Both Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are linear methods used for dimensionality reduction, but they differ in their objective. PCA seeks to build a small-dimensional coordinate system using whatever small collection of components (i.e. combinations of variables) best explains the variability of the underlying dataset, and then simply throw out the coordinates do not have a high impact on variation. LDA on the other hand seeks to project the dataset onto a lower dimensional space in a way that best preserves separation of distinct data classes. 

![PCA vs LDA](PCA_vs_LDA.png)

[image source](https://medium.com/analytics-vidhya/pca-vs-lda-vs-t-sne-lets-understand-the-difference-between-them-22fa6b9be9d0)

### Principal Component Analysis

In [None]:
from sklearn import decomposition
pca = decomposition.PCA(n_components=3)
embeddings_pca = pd.DataFrame(
    pca.fit_transform(embeddings),
    columns=['PC0', 'PC1', 'PC2']
)
embeddings_pca["query"] = [paper["query"] for paper in papers.values()]

---

### Linear Discriminant Analysis

---

In [None]:
lda = LinearDiscriminantAnalysis(n_components=1)
embeddings_lda = pd.DataFrame(
    lda.fit_transform(embeddings, categories), columns=["lda0"]
)

embeddings_lda["query"] = categories

---

## Exercise 4

Here you are tasked with describing how to parallelize the merge sort algorithm using two processes. Before you try to parallelize it, let's try to get a good understanding of how the merge sort works in general. Suppose you would like to sort the following list of values into assending order:

```json
[5, 7, 6, 1]
```

We start by splitting the list into two equal halfs, effectively by making a copy of each half:

```json
[5, 7, 6, 1]
```
$\hspace{3.5em}\swarrow\searrow$
```json
[[5,7],[6,1]]
```

We continue to recursively break up each sub-list one at a time until we encounter two lists containing at most one element each. Breaking up the first sublist above, we arrive at the following:

```json
[5, 7]
```

$\hspace{2em}\swarrow\searrow$

```json
[[5],[7]]
```

Now that we have two sublists of size 1, it is time to merge. We accomplish this by comparing the first element from each sublist. Whichever is smaller goes is removed from the split sublist and appended to the end of the sublist.

```json
{
    split: [[5],[7]],
    merged: []
}
```

$\hspace{4em}\Downarrow$ 5 is removed from split and added to merged

```json
{
    split: [[],[7]],
    merged: [5]
}
```

Now we try to compare the first element of each of the split sublists. Since the first sublist has run out of elements, we have to put the rest of them in the merged sublist. We've already compared the smallest element of the second sublist with all of the elements of the merged list, so we can conclude that the the second sublist can get added to the end of the merged list

```json
{
    split: [[],[7]],
    merged: [5]
}
```

$\hspace{4em}\Downarrow$ remaining elements of our second sublist are appended to the merged list

```json
{
    split: [[],[]],
    merged: [5,7]
}
```

Now that the first pair of single-element sublists are merged, we rinse and repeat with the next sublist:

### Split
```json
[6, 1]
```

$\hspace{2em}\swarrow\searrow$

```json
[[6],[1]]
```

### Merge

```json
{
    split: [[6],[1]],
    merged: []
}
```

$\hspace{4em}\Downarrow$ 1 is removed from split and added to merged

```json
{
    split: [[6],[]],
    merged: [1]
}
```

$\hspace{4em}\Downarrow$ remaining elements of our first sublist are appended to the merged list

```json
{
    split: [[],[]],
    merged: [1,6]
}
```


Now all of the single element sublists have been merged into sorted lists. We start merging these larger lists, which is only slightly more complicated:

```json
{
    split: [[5,7],[1,6]],
    merged: []
}
```

We compare the first element of each split sublist. Whichever is smaller goes in the merged sublist first:

```json
{
    split: [[5,7],[1,6]],
    merged: []
}
```

$\hspace{4em}\Downarrow$ 1 is removed from split and added to merged

```json
{
    split: [[5,7],[6]],
    merged: [1]
}
```

Now the second split sublist has a new smallest element, so we have to make the comparison again.

```json
{
    split: [[5,7],[6]],
    merged: [1]
}
```

$\hspace{4em}\Downarrow$ 5 is removed from split and added to merged

```json
{
    split: [[7],[6]],
    merged: [1,5]
}
```

Repeating this step one more time, we get:

```json
{
    split: [[7],[6]],
    merged: [1,5]
}
```
$\hspace{4em}\Downarrow$ 6 is removed from split and added to merged

```json
{
    split: [[7],[]],
    merged: [1,5,6]
}
```

Again, one of our split sublists has run out of elements, so we have to merge the rest of the other sublist into the merged list. Since the smallest element of this split sublist has already been compared with the largest element of the merged sublist, every element of the split sublist is larger than every element of the merged sublist. Hence, we just append all the elements from the split sublist to the end of the merged sublist:

```json
{
    split: [[7],[]],
    merged: [1,5,6]
}
```

$\hspace{4em}\Downarrow$ remaining elements of split sublist appended to the end of merged

```json
{
    split: [[],[]],
    merged: [1,5,6,7]
}
```

And so our list is sorted, and our task is done.

### Disclaimer:
The explanation above is for illustrative purposes only. Making many copies of lists and resizing lists both use lots of memory and can slow down your program as your dataset scales. It is more efficient to define a by specifying its beginning and ending indices, and to "remove" an element from a sublist by changing those index bounds.

---

## Exercise 5

You should already have most of the tools you need to answer this question, however we shall briefly review missing data, as well as some basic strategies for dealing with missing data.

### There are three main forms that missing data can take ordered from best case to worst case scenario:

 1. Missing Completely at Random (MCAR): there is no reason for missing data inherent to the data itself. In other words, all entries are equally likely to be missing.
 2. Missing at Random (MAR): whether or not an entry is missing correlates to one or more known variables.
 3. Missing Not at Random (MNAR): whether or not an entry is missing correlates to some unknown variable

### Quick and Dirty MAR/MNAR Test


To detect MAR/MNAR data, choose a field with some missing entries. According to this field, we split our dataset into two groups:

1. A group of all rows missing that field's entry
2. A group of all rows not missing that field's entry

Check to see if the summary statistics of each group are similar. If those summary statistics differ by a significant amount, it's a sign that our data is either MAR or MNAR. If the summary statistics of the two groups are very similar, it doesn't prove anything, but it at leasts suggests that your data might be MCAR.

**Note**: In practice, it is often impossible to tell the difference between MNAR and MAR since the data you would need to determine this difference is precisely the data that is missing.

Remember you may multiple fields with missing data, so it's a good idea to run the above process for every field with missing data.

### Little's Test for MCAR

Little's test is more rigorous than the above quick test, but more involved. The internal workings of this test are beyond the scope of this class, but the short explanation is that technique defines a test statistic that approaches $\chi^2$ (chi-squared) for large datasets. There are available packages in R and SPSS to run this test.

----

### Dealing with Missing Data

There are four basic strategies we can use to deal with missing data:
1. **Ignore it:** pretend that the problem isn't real and can't hurt you.
2. **Drop rows:** drop any row that is missing entries in a field that we care about. 
3. **Drop the field:** drop any field that is missing too many entries to salvage.
4. **Impute missing values:** make an educated guess at what the missing value **could** be based on the data you have.

### Ignoring Missing Data

Obviously this is risky, and often inadvisable. There is little more to say here.

### Dropping Rows

#### Listwise vs Pairwise Deletion
There are two ways we can drop rows:

1. **Listwise deletion:** Before any analyses begin, delete all the rows with values missing in any of the fields we will use for *any* of our analyses.
2. **Pairwise deletion:** Before an individual analysis, delete only the rows with entries missing in the fields we will use for *that particular* analysis.

Listwise deletion is a bit easier, but often results in us deleting more data than we have to.
Pairwise deletion is a bit harder, but usually preserves more data for each individual analysis.

For example, suppose we wanted to run two separate linear regressions: one regression between variables A & B, and one regression between A & C. If we used listwise deletion, we would start by deleting all the rows missing any entries in fields A, B, or C. Only then would we run our regressions. Notice that this means our A-B regression might then lose out on some rows that are only missing the C entry, which is totally irrelevant to that analysis.

Supposing we wanted to run the same regressions, but instead chose to use pairwise deletion. We would start by removing all the rows with missing data in the A or B fields from a copy of our original dataset, then run our A-B regression on that filtered dataset. Then we would remove all the rows with missing data in the A or C fields from another copy of our original dataset, and run the A-C regression on the new filtered dataset. By doing so, our A-B regression won't lose out on any rows due to any missing values in the C column.

#### Risks of Dropping Rows

Since dropping rows decreases the size of the sample you can use for analyses, this process decreases the statistical power of your models. This is doubly relevant for listwise deletion, since that method usually results in more dropped rows.

This method will also bias your data if it is not MCAR, since some properties of your data will correlate with missing entries.

### Dropping Fields

If a given field has a high proportion of missing data, analyses based on that field may not be very accurate, and are unlikely to generalize well. If this is the case, you should probably drop that field entirely.

This will not bias your data like dropping rows can, but having fewer variables may reduce the power of your model.

### Imputation

Imputation is a technique where we predict what a missing value should be based on the rest of the dataset. Imputation has a distinct advantage over dropping strategies: it yields more data. This can improve the power of our analyses **if we chose our imputed values wisely**. Chosing our imputed values unwisely can result in totally worthless models, so be careful when you're cleaning your data this way.

#### Bad Imputation Strategies (do not attempt at home):


Here are a tips if you would like to stay in the good graces of your graders and employers:

1. **DO NOT** use the mean, median, mode, etc. of a field to fill that field's missing values
    - e.g. $\textit{missing height} = \mathbb{E}[\textit{height}] $
2. **DO NOT** use linear regression predictions to replace missing values
    - e.g. $\textit{missing value} = \beta_0 + \beta_1 x_1 + \ldots + \beta_n x_n$
3. **DO NOT** use stochastic regression predictions either
    - e.g. $\textit{missing value} = \beta_0 + \beta_1 x_1 + \ldots + \beta_n x_n + \varepsilon_i$, where $\varepsilon_i \sim N(0,s)$.

#### Deterministic Imputation (attempt at home)

If there is some logical reason why some values in a given row imply a missing entry can only have one value, then this is our best option.
For example, if you already know a company's income and expenses for the quarter, you can fill its missing profit entry for that quarter with income - expenses. Or if you know that a patient is younger than 65, then by definition, that patient cannot have late-onset dementia. 

The only risk here is that your logical reasoning is incorrect. As long as the rules you use to choose your imputed values are sound, then this method can do no wrong.

#### Multiple Imputation (attempt at home)

Multiple imputation is simply the process of making many predictions for a missing value and then combining those predictions to get some final imputed value. This technique can vary in the way we choose to make our predictions, as well as the way we choose to combine them. One valid method is to use the non-missing portions of our data to construct conditional distributions for each of the missing values given other values in that row. Drawing from these conditional distributions many times, using those drawn values to construct predictors, and then combining those predictors with averaging or voting will yield our final predictor. Two popular methods for using values drawn from conditional distributions to construct predictors are the Full Information Maximum Likelihood (FIML) and Expectation Maximization (EM) methods.