# Assignment 2 Cheat Sheet

## Problem 1
### Parsing XML

Our first problem will require us to parse lots of data from an XML file. Let's start importing a tool from the xml module and loading some dummy data so we can practice parsing XML

In [28]:
import xml.etree.ElementTree as ET
from pprint import pprint as pp 
tree = ET.parse('./books.xml')
root = tree.getroot()

The first line in the cell above just imports the ElementTree module and saves it under the shortened alias, "ET". 

The second line imports the pretty print module as pp. This simply prints things in a more readable format than python's built in print function

The third line will look for a file called "books.xml" in the same folder where this cheatsheet2.ipynb file is stored. If it finds such a file, the ET module will use that file to construct a document object model (DOM) that mirrors the structure of the XML file inside python. 

Often to navigate a DOM tree, it is easiest to start from the tree's root element and iteratively move to our current element's children until we find what we're looking for. Our fourth line just selects the root element of the DOM tree for this purpose. Let's poke around until we understand the basics of navigating our DOM tree. It helps to have the XML file open on the side as a roadmap

In [29]:
pp(root)

<Element 'catalog' at 0x7ff6b83e7590>


As evidenced by the above cell, the root of our DOM tree corresponds to the catalog element. Referring to the xml file, we should expect that it has several chidren book elements, each with a unique id property. Each of those book elements has its own children corresponding to information related to that specific book. We can iterate over all the children of our root node (i.e. each book element) using a for loop:

In [30]:
for child in root:
    pp(child)

<Element 'book' at 0x7ff6b8366e50>
<Element 'book' at 0x7ff6b836a180>
<Element 'book' at 0x7ff6b836a400>
<Element 'book' at 0x7ff6b836a630>
<Element 'book' at 0x7ff6b836a8b0>
<Element 'book' at 0x7ff6b836ab80>
<Element 'book' at 0x7ff6b836ae00>
<Element 'book' at 0x7ff6b836e090>
<Element 'book' at 0x7ff6b836e2c0>
<Element 'book' at 0x7ff6b836e540>
<Element 'book' at 0x7ff6b836e770>
<Element 'book' at 0x7ff6b836ea40>


We can access an individual child of a given element in several different ways:

In [31]:
#Select the child by its index, similar to a python list
book1 = root[0]
pp(book1)

<Element 'book' at 0x7ff6b8366e50>


In [32]:
#Select the child by using its tag name
author1 = book1.find("author") 
pp(author1)

<Element 'author' at 0x7ff6b8366ea0>


Note that if book1 had multiple author tags inside it, the find method would only select the first matching tag amongst its children.
We can also pull the text out from the inside of an element:

In [33]:
text1 = author1.text
pp(text1)

'Gambardella, Matthew'


We can directly select grandchildren, great-grandchildren, etc. of a given element by specifying the path of that element relative to our current element:

In [34]:
title1 = root.find('book/title')#finds a title element inside a book element inside root
pp(title1.text)

"XML Developer's Guide"


Suppose we'd like to get all the matching children from a search, rather than just the first one. We simply use the findall method instead:

In [35]:
books = root.findall('book')
pp(books)

[<Element 'book' at 0x7ff6b8366e50>,
 <Element 'book' at 0x7ff6b836a180>,
 <Element 'book' at 0x7ff6b836a400>,
 <Element 'book' at 0x7ff6b836a630>,
 <Element 'book' at 0x7ff6b836a8b0>,
 <Element 'book' at 0x7ff6b836ab80>,
 <Element 'book' at 0x7ff6b836ae00>,
 <Element 'book' at 0x7ff6b836e090>,
 <Element 'book' at 0x7ff6b836e2c0>,
 <Element 'book' at 0x7ff6b836e540>,
 <Element 'book' at 0x7ff6b836e770>,
 <Element 'book' at 0x7ff6b836ea40>]


In just the same fashion, we can get all the descendents of an element with a given relationship to the current element using a path as our search query:

In [38]:
titles = root.findall('book/title')
for title in titles:
    pp(title.text)

"XML Developer's Guide"
'Midnight Rain'
'Maeve Ascendant'
"Oberon's Legacy"
'The Sundered Grail'
'Lover Birds'
'Splish Splash'
'Creepy Crawlies'
'Paradox Lost'
'Microsoft .NET: The Programming Bible'
'MSXML3: A Comprehensive Guide'
'Visual Studio 7: A Comprehensive Guide'


If our DOM is very large and complicated, it may not be practical to manually search through the whole tree. In such cases, it is useful to recursively search through all the subtrees under an element. This is handled automatically by the iter method, which will search for matches among an element's children, grandchildren, and so on:

In [39]:
for price in root.iter('price'):
    pp(price.text)

'44.95'
'5.95'
'5.95'
'5.95'
'5.95'
'4.95'
'4.95'
'4.95'
'6.95'
'36.95'
'36.95'
'49.95'


Now that we know how to extract data from an XML file, we can load this information into a pandas dataframe for analysis:

In [40]:
import pandas as pd

In [42]:
book_data = []

for book in root:
    book_dict = {
        'title': book.find('title').text,
        'author': book.find('author').text,
        'price': float(book.find('price').text)
    }
    book_data.append(book_dict)

book_df = pd.DataFrame(book_data)

In [43]:
book_df.head()

### Bisection (Binary) Search Review

You are also tasked with writing an algorithm that searches for a given value in a sorted list in $O(\log n)$ time by using bisection. Let us quickly review how this algorithm works to help you on your way. 

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |

Suppose we have the above ordered list of natural numbers, and we'd like to find the location of the value 13. We can start by specifying the outer left and right indices, $L$ and $R$. Now take the midpoint between those two indices $M = [\frac{(L + R)}2]$. These square brackets indicate you will have to round M to a whole number if $(L+R)$ is odd.

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
| markers | L |   |   |   |   | M  |    |    |    |    | R  |

As shown above, we calculate that $M=5$. We now compare the value at this midpoint, $8$, against our search value, $13$. Clearly $8 < 13$. Since the list is sorted, we know that all the values to the left of our midpoint index is smaller than $13$. Therefore, we exclude these values our future search by moving our left index $L$ to our current midpoint, $M$. Notice that we have just cut our search space **in half**. 

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
| markers |   |   |   |   |   | L  |    |    | M   |    | R  |

Repeating the process above, we set the new $M$ to [$\frac{(L+R)}2$]$=8$, rounding as necessary. The value at index $8$ is $34$, which is greater than our search value $13$. Again, since the list is sorted, we know all values to the right of $M$ are greater than 13, so we will exclude them from our future search by moving $R$ to our current $M$.

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
| markers |   |   |   |   |   | L  |    |  M  | R   |    |  |

Again, we repeat from above. We set the new $M$ to [$\frac{(L+R)}2$]$=7$. The value at index $M=7$ is $21$, which is greater than 13, so we exclude all values to the right of $M$ by moving $R$ to the current $M$.

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
| markers |   |   |   |   |   | L  |  M  |  R |    |    |  |

Finally, we find that $M =$ [$\frac{(L+R)}2$]$=6$. The value at the index $6$ is $13$, which matches our search value, so our search is done, and our algorithm should output the index $6$.

What do you do if the search value is nowhere to be found in the list? For example, if we had tried to find $15$ in the list above, the steps I have described would result in a neverending loop, so we must put in a safeguard that terminates the algorithm in case $L$ and $R$ are adjacent. For our purposes, it may be useful for the algorithm to output the final value of $\frac{(L+R)}2$ (which is not an integer) to indicate that our search was unsuccessful, as well as to divide the list into values below and above our search value.

Ask yourself, why does the above process only take $O(\log n)$ time?

## Problem 2

In [44]:
#explore  SEQIO & Biopython

In [45]:
#Conceptual refresher on universal hash families

## Problem 3

Your friend is running out of memory. What kind of data structure is your friend using? Can you think of any less memory-intensive alternatives?

## Problem 4

Now you get to look for an interesting dataset. Here are several good sites where you can find free data, but feel free to look elsewhere if you prefer:
1. [Kaggle](https://www.kaggle.com/)
2. [data.gov](https://data.gov/)
3. [data.ct.gov](https://data.ct.gov/)