# Assignment 2 Cheat Sheet

## Problem 1
### Extensible Markup Language (XML)

Our first problem will require us to parse lots of data from an XML file. Let's start importing a tool from the xml module and loading some dummy data so we can practice parsing XML

In [3]:
import xml.etree.ElementTree as ET
from pprint import pprint as pp 
tree = ET.parse('./books.xml')
root = tree.getroot()

The first line in the cell above just imports the ElementTree module and saves it under the shortened alias, "ET". 

The second line imports the pretty print module as pp. This simply prints things in a more readable format than python's built in print function

The third line will look for a file called "books.xml" in the same folder where this cheatsheet2.ipynb file is stored. If it finds such a file, the ET module will use that file to construct a document object model (DOM) that mirrors the structure of the XML file inside python. 

Often to navigate a DOM tree, it is easiest to start from the tree's root element and iteratively move to our current element's children until we find what we're looking for. Our fourth line just selects the root element of the DOM tree for this purpose. Let's poke around until we understand the basics of navigating our DOM tree. It helps to have the XML file open on the side as a roadmap

In [4]:
pp(root)

<Element 'catalog' at 0x7fa9a26930b0>


As evidenced by the above cell, the root of our DOM tree corresponds to the catalog element. Referring to the xml file, we should expect that it has several chidren book elements, each with a unique id property. Each of those book elements has its own children corresponding to information related to that specific book. We can iterate over all the children of our root node (i.e. each book element) using a for loop:

In [5]:
for child in root:
    pp(child)

<Element 'book' at 0x7fa9a26cc450>
<Element 'book' at 0x7fa9a26cc6d0>
<Element 'book' at 0x7fa9a26cca90>
<Element 'book' at 0x7fa9a26ccc70>
<Element 'book' at 0x7fa9a26cd080>
<Element 'book' at 0x7fa9a26cd350>
<Element 'book' at 0x7fa9a26cd5d0>
<Element 'book' at 0x7fa9a26cd800>
<Element 'book' at 0x7fa9a26cda30>
<Element 'book' at 0x7fa9a26cdcb0>
<Element 'book' at 0x7fa9a26cdee0>
<Element 'book' at 0x7fa9a26ce1b0>


Notice that every time we have tried to print an element of the DOM directly, the output is just a description of an object of type "Element", with a name (e.g. "catalog" or "book"), together with a description of the location of that object in memory. If we would instead like to print out a the XML string of subtree of the DOM starting at a given element, we would use the ET.tostring() function to get that text

In [6]:
book1 = root[0]
book1_xml = ET.tostring(book1)
print(book1_xml)

b'<book id="bk102">\n      <author>Ralls, Kim</author>\n      <title>Midnight Rain</title>\n      <genre>Fantasy</genre>\n      <price>5.95</price>\n      <publish_date>2000-12-16</publish_date>\n      <description>A former architect battles corporate zombies, \n      an evil sorceress, and her own childhood to become queen \n      of the world.</description>\n   </book>\n   '


Why does this look so funky? Observe that there is a "b" before the quotation marks in the output of the cell above. This indicates that book1_xml is not a regular Python string, but is in fact a "bytestring". We can confirm the type of book1_xml with the built-in Python function "type".

In [7]:
type(book1_xml)

bytes

Although Python's print function does an okay job at printing bytestrings, and pp does just a little bit better, the result will be much prettier if we decode the bytes into a proper string:

In [8]:
book1_string = book1_xml.decode('utf-8')
print(book1_string)

<book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   </book>
   


Much better, although the end tag for the book element is misaligned. If the strange indentation bothers you, you can fix it by using the ET.indent() function:

In [9]:
ET.indent(book1)
print(ET.tostring(book1).decode('utf-8'))

<book id="bk102">
  <author>Ralls, Kim</author>
  <title>Midnight Rain</title>
  <genre>Fantasy</genre>
  <price>5.95</price>
  <publish_date>2000-12-16</publish_date>
  <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
</book>
   


This is sometimes easier to read with a bit of syntax highlighting:
```xml
<book id="bk101">
  <author>Gambardella, Matthew</author>
  <title>XML Developer's Guide</title>
  <genre>Computer</genre>
  <price>44.95</price>
  <publish_date>2000-10-01</publish_date>
  <description>An in-depth look at creating applications 
      with XML.</description>
</book>
```


We can access an individual child of a given element in several different ways:

In [10]:
#Select the child by its index, similar to a python list
book1 = root[0]
pp(book1)

<Element 'book' at 0x7fa9a26cc450>


In [11]:
#Select the child by using its tag name
author1 = book1.find("author") 
pp(author1)

<Element 'author' at 0x7fa9a26cc360>


Note that if book1 had multiple author tags inside it, the find method would only select the first matching tag amongst its children.
We can also pull the text out from the inside of an element:

In [12]:
text1 = author1.text
pp(text1)

'Ralls, Kim'


We can directly select grandchildren, great-grandchildren, etc. of a given element by specifying the path of that element relative to our current element:

In [13]:
title1 = root.find('book/title')#finds a title element inside a book element inside root
pp(title1.text)

'Midnight Rain'


Suppose we'd like to get all the matching children from a search, rather than just the first one. We simply use the findall method instead:

In [14]:
books = root.findall('book')
pp(books)

[<Element 'book' at 0x7fa9a26cc450>,
 <Element 'book' at 0x7fa9a26cc6d0>,
 <Element 'book' at 0x7fa9a26cca90>,
 <Element 'book' at 0x7fa9a26ccc70>,
 <Element 'book' at 0x7fa9a26cd080>,
 <Element 'book' at 0x7fa9a26cd350>,
 <Element 'book' at 0x7fa9a26cd5d0>,
 <Element 'book' at 0x7fa9a26cd800>,
 <Element 'book' at 0x7fa9a26cda30>,
 <Element 'book' at 0x7fa9a26cdcb0>,
 <Element 'book' at 0x7fa9a26cdee0>,
 <Element 'book' at 0x7fa9a26ce1b0>]


In just the same fashion, we can get all the descendents of an element with a given relationship to the current element using a path as our search query:

In [15]:
titles = root.findall('book/title')
for title in titles:
    pp(title.text)

'Midnight Rain'
"XML Developer's Guide"
'Maeve Ascendant'
"Oberon's Legacy"
'The Sundered Grail'
'Lover Birds'
'Splish Splash'
'Creepy Crawlies'
'Paradox Lost'
'Microsoft .NET: The Programming Bible'
'MSXML3: A Comprehensive Guide'
'Visual Studio 7: A Comprehensive Guide'


If our DOM is very large and complicated, it may not be practical to manually search through the whole tree. In such cases, it is useful to recursively search through all the subtrees under an element. This is handled automatically by the iter method, which will search for matches among an element's children, grandchildren, and so on:

In [16]:
for price in root.iter('price'):
    pp(price.text)

'5.95'
'44.95'
'5.95'
'5.95'
'5.95'
'4.95'
'4.95'
'4.95'
'6.95'
'36.95'
'36.95'
'49.95'


Now that we know how to extract data from an XML file, we can load this information into a pandas dataframe for analysis:

In [17]:
import pandas as pd

In [18]:
book_data = []

for book in root:
    book_dict = {
        'title': book.find('title').text,
        'author': book.find('author').text,
        'price': float(book.find('price').text)
    }
    book_data.append(book_dict)

book_df = pd.DataFrame(book_data)

In [19]:
book_df.head()

Unnamed: 0,title,author,price
0,Midnight Rain,"Ralls, Kim",5.95
1,XML Developer's Guide,"Gambardella, Matthew",44.95
2,Maeve Ascendant,"Corets, Eva",5.95
3,Oberon's Legacy,"Corets, Eva",5.95
4,The Sundered Grail,"Corets, Eva",5.95


### Bisection (Binary) Search Review

You are also tasked with writing an algorithm that searches for a given value in a sorted list in $O(\log n)$ time by using bisection. Let us quickly review how this algorithm works to help you on your way. 

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |

Suppose we have the above ordered list of natural numbers, and we'd like to find the location of the value 13. We can start by specifying the outer left and right indices, $L$ and $R$. Now take the midpoint between those two indices $M = \big[\frac{L + R}2\big]$. These square brackets indicate you will have to round M to a whole number if $(L+R)$ is odd.

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
| markers | L |   |   |   |   | M  |    |    |    |    | R  |

As shown above, we calculate that $M=5$. We now compare the value at this midpoint, $8$, against our search value, $13$. Clearly $8 < 13$. Since the list is sorted, we know that all the values to the left of our midpoint index is smaller than $13$. Therefore, we exclude these values our future search by moving our left index $L$ to our current midpoint, $M$. Notice that we have just cut our search space **in half**. 

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
| markers |   |   |   |   |   | L  |    |    | M   |    | R  |

Repeating the process above, we set the new $M$ to [$\frac{L+R}2$]$=8$, rounding as necessary. The value at index $8$ is $34$, which is greater than our search value $13$. Again, since the list is sorted, we know all values to the right of $M$ are greater than $13$, so we will exclude them from our future search by moving $R$ to our current $M$.

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
| markers |   |   |   |   |   | L  |    |  M  | R   |    |  |

Again, we repeat from above. We set the new $M$ to $\big[\frac{L+R}2\big]=7$. The value at index $M=7$ is $21$, which is greater than $13$, so we exclude all values to the right of $M$ by moving $R$ to the current $M$.

| index   | 0 | 1 | 2 | 3 | 4 | 5 | 6  | 7  | 8  | 9  | 10 |
|---------|---|---|---|---|---|---|----|----|----|----|----|
| value   | 1 | 1 | 2 | 3 | 5 | 8 | 13 | 21 | 34 | 55 | 89 |
| markers |   |   |   |   |   | L  |  M  |  R |    |    |  |

Finally, we find that $M = \big[\frac{(L+R)}2\big]=6$. The value at the index $6$ is $13$, which matches our search value, so our search is done, and our algorithm should output the index $6$.

What do you do if the search value is nowhere to be found in the list? For example, if we had tried to find $15$ in the list above, the steps I have described would result in a neverending loop, so we must put in a safeguard that terminates the algorithm in case $L$ and $R$ are adjacent. For our purposes, it may be useful for the algorithm to output the final value of $\frac{L+R}2$ (which is not an integer) to indicate that our search was unsuccessful, as well as to divide the list into values below and above our search value.

Ask yourself, why does the above process only take $O(\log n)$ time?

## Problem 2

### Biopython and SeqIO

For this problem, you're going to need access to the biopython library. We'll start by downloading the library using anaconda in the terminal, and importing some of its tools into the project:

In [20]:
! conda install -c conda-forge biopython -y

Collecting package metadata (current_repodata.json): done
Solving environment: / 
  - anaconda/linux-64::openssl-1.1.1q-h7f8727e_0
  - defaults/linux-64::openssl-1.1.1q-h7f8727edone


  current version: 4.12.0
  latest version: 4.14.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/max/.conda/envs/py310

  added / updated specs:
    - biopython


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    biopython-1.79             |  py310h6acc77f_1         3.0 MB  conda-forge
    ca-certificates-2022.9.14  |       ha878542_0         152 KB  conda-forge
    certifi-2022.9.14          |     pyhd8ed1ab_0         156 KB  conda-forge
    python_abi-3.10            |          2_cp310           4 KB  conda-forge
    ------------------------------------------------------------
                                           Tota

In [52]:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
import re


Now that we have the SeqIO module ready, let's use it to read some data into python. Here we read in a simple data set containing a Coronavirus spike protein trimer sequence from [this site](https://www.rcsb.org/structure/3JCL). We can just as easily read in data from

In [22]:
proteins = SeqIO.parse("./query_sequences.fasta", "fasta")

Let's take a look at the output of this parsing function, how to navigate it, and how to extract the info from it that we might need.

In [23]:
print(proteins)

<Bio.SeqIO.FastaIO.FastaIterator object at 0x7fa9d8587490>


From this print statement, we see that the proteins object is some kind of iterator. This suggests that we are able to iterate over it using a for-loop, or to load the contents of the iterator into a list, as shown below:

In [24]:
#list of all elements iterated over by protein
protein_list = list(proteins)

In [25]:
print(protein_list)

[SeqRecord(seq=Seq('YIGDFRCIQLVNSNGANVSAPSISTETVEVSQGLGTYYVLDRVYLNATLLLTGY...FEK'), id='3JCL_1|Chains', name='3JCL_1|Chains', description='3JCL_1|Chains A, B, C|Spike glycoprotein|Murine hepatitis virus strain A59 (11142)', dbxrefs=[])]


Since our FASTA file only encoded a single protein, this list has only one element. Let's take a look at it:

In [26]:
protein = protein_list[0]
print(protein)

ID: 3JCL_1|Chains
Name: 3JCL_1|Chains
Description: 3JCL_1|Chains A, B, C|Spike glycoprotein|Murine hepatitis virus strain A59 (11142)
Number of features: 0
Seq('YIGDFRCIQLVNSNGANVSAPSISTETVEVSQGLGTYYVLDRVYLNATLLLTGY...FEK')


As we see above, this protein object has several attributes we might care about: ID, Name, Description, and Sequence. We can access each of these attributes individually:

In [27]:
print(f"Protein ID: {protein.id}")

Protein ID: 3JCL_1|Chains


In [28]:
print(f"Protein Name: {protein.name}")

Protein Name: 3JCL_1|Chains


In [29]:
print(f"Protein Description: {protein.description}")

Protein Description: 3JCL_1|Chains A, B, C|Spike glycoprotein|Murine hepatitis virus strain A59 (11142)


In [30]:
print(f"Protein Sequence: {protein.seq}")

Protein Sequence: YIGDFRCIQLVNSNGANVSAPSISTETVEVSQGLGTYYVLDRVYLNATLLLTGYYPVDGSKFRNLALRGTNSVSLSWFQPPYLNQFNDGIFAKVQNLKTSTPSGATAYFPTIVIGSLFGYTSYTVVIEPYNGVIMASVCQYTICQLPYTDCKPNTNGNKLIGFWHTDVKPPICVLKRNFTLNVNADAFYFHFYQHGGTFYAYYADKPSATTFLFSVYIGDILTQYYVLPFICNPTAGSTFAPRYWVTPLVKRQYLFNFNQKGVITSAVDCASSYTSEIKCKTQSMLPSTGVYELSGYTVQPVGVVYRRVANLPACNIEEWLTARSVPSPLNWERKTFQNCNFNLSSLLRYVQAESLFCNNIDASKVYGRCFGSISVDKFAVPRSRQVDLQLGNSGFLQTANYKIDTAATSCQLHYTLPKNNVTINNHNPSSWNRRYGFNDAGVFGKNQHDVVYAQQCFTVRSSYCPCAQPDIVSPCTTQTKPKSAFVNVGDHCEGLGVLEDNCGNADPHKGCICANNSFIGWSHDTCLVNDRCQIFANILLNGINSGTTCSTDLQLPNTEVVTGICVKYDLYGITGQGVFKEVKADYYNSWQTLLYDVNGNLNGFRDLTTNKTYTIRSCYSGRVSAAFHKDAPEPALLYRNINCSYVFSNNISREENPLNYFDSYLGCVVNADNRTDEALPNCDLRMGAGLCVDYSKSRRAHSSVSTGYRLTTFEPYTPMLVNDSVQSVDGLYEMQIPTNFTIGHHEEFIQTRSPKVTIDCAAFVCGDNTACRQQLVEYGSFCVNVNAILNEVNNLLDNMQLQVASALMQGVTISSRLPDGISGPIDDINFSPLLGCIGSTCAEDGNGPSAIRGRSAIEDLLFDKVKLSDVGFVEAYNNCTGGQEVRDLLCVQSFNGIKVLPPVLSESQISGYTTGATAAAMFPPWSAAAGVPFSLSVQYRINGLGVTMNVLSENQKMIASAFNNALGAIQDGFDATNSALG

Each of the letters in the above sequence represents an Amino acid in the spike protein. This is similar to the way in which each letter in a DNA sequence represents a single nucleotide. If you want to feed such a sequence into a hash function, you may need to encode it as a bytestring using the following method:

In [31]:
sequence_bytes = str(protein.seq).lower().encode('utf8')
print(sequence_bytes)

b'yigdfrciqlvnsnganvsapsistetvevsqglgtyyvldrvylnatllltgyypvdgskfrnlalrgtnsvslswfqppylnqfndgifakvqnlktstpsgatayfptivigslfgytsytvviepyngvimasvcqyticqlpytdckpntngnkligfwhtdvkppicvlkrnftlnvnadafyfhfyqhggtfyayyadkpsattflfsvyigdiltqyyvlpficnptagstfaprywvtplvkrqylfnfnqkgvitsavdcassytseikcktqsmlpstgvyelsgytvqpvgvvyrrvanlpacnieewltarsvpsplnwerktfqncnfnlssllryvqaeslfcnnidaskvygrcfgsisvdkfavprsrqvdlqlgnsgflqtanykidtaatscqlhytlpknnvtinnhnpsswnrrygfndagvfgknqhdvvyaqqcftvrssycpcaqpdivspcttqtkpksafvnvgdhceglgvledncgnadphkgcicannsfigwshdtclvndrcqifanillnginsgttcstdlqlpntevvtgicvkydlygitgqgvfkevkadyynswqtllydvngnlngfrdlttnktytirscysgrvsaafhkdapepallyrnincsyvfsnnisreenplnyfdsylgcvvnadnrtdealpncdlrmgaglcvdysksrrahssvstgyrlttfepytpmlvndsvqsvdglyemqiptnftighheefiqtrspkvtidcaafvcgdntacrqqlveygsfcvnvnailnevnnlldnmqlqvasalmqgvtissrlpdgisgpiddinfspllgcigstcaedgngpsairgrsaiedllfdkvklsdvgfveaynnctggqevrdllcvqsfngikvlppvlsesqisgyttgataaamfppwsaaagvpfslsvqyringlgvtmnvlsenqkmiasafnnalgaiqdgfdatnsalgkiqsvvnanaealnnl

We can likewise decode a bytestring back into a string, if we like:

In [32]:
sequence_string = sequence_bytes.decode('utf8').upper()
print(sequence_string)

YIGDFRCIQLVNSNGANVSAPSISTETVEVSQGLGTYYVLDRVYLNATLLLTGYYPVDGSKFRNLALRGTNSVSLSWFQPPYLNQFNDGIFAKVQNLKTSTPSGATAYFPTIVIGSLFGYTSYTVVIEPYNGVIMASVCQYTICQLPYTDCKPNTNGNKLIGFWHTDVKPPICVLKRNFTLNVNADAFYFHFYQHGGTFYAYYADKPSATTFLFSVYIGDILTQYYVLPFICNPTAGSTFAPRYWVTPLVKRQYLFNFNQKGVITSAVDCASSYTSEIKCKTQSMLPSTGVYELSGYTVQPVGVVYRRVANLPACNIEEWLTARSVPSPLNWERKTFQNCNFNLSSLLRYVQAESLFCNNIDASKVYGRCFGSISVDKFAVPRSRQVDLQLGNSGFLQTANYKIDTAATSCQLHYTLPKNNVTINNHNPSSWNRRYGFNDAGVFGKNQHDVVYAQQCFTVRSSYCPCAQPDIVSPCTTQTKPKSAFVNVGDHCEGLGVLEDNCGNADPHKGCICANNSFIGWSHDTCLVNDRCQIFANILLNGINSGTTCSTDLQLPNTEVVTGICVKYDLYGITGQGVFKEVKADYYNSWQTLLYDVNGNLNGFRDLTTNKTYTIRSCYSGRVSAAFHKDAPEPALLYRNINCSYVFSNNISREENPLNYFDSYLGCVVNADNRTDEALPNCDLRMGAGLCVDYSKSRRAHSSVSTGYRLTTFEPYTPMLVNDSVQSVDGLYEMQIPTNFTIGHHEEFIQTRSPKVTIDCAAFVCGDNTACRQQLVEYGSFCVNVNAILNEVNNLLDNMQLQVASALMQGVTISSRLPDGISGPIDDINFSPLLGCIGSTCAEDGNGPSAIRGRSAIEDLLFDKVKLSDVGFVEAYNNCTGGQEVRDLLCVQSFNGIKVLPPVLSESQISGYTTGATAAAMFPPWSAAAGVPFSLSVQYRINGLGVTMNVLSENQKMIASAFNNALGAIQDGFDATNSALGKIQSVVNANAEALNNLLN

Suppose we want to look at small chunks of this sequence, i.e. a $k$-mer. We can take "slices" of the sequence by specifying the start and end indices of the slice:

In [33]:
smallseq =  sequence_bytes[0:20] #all elements from 0 to 19
print(smallseq)

b'yigdfrciqlvnsnganvsa'


We will often want to scan across a large sequence to look at all its $k$-mers of a given size. This can be accomplished by taking slices in a loop:

In [35]:
index = 0
k = 10
n = len(smallseq)

kmers = []
for index in range(0, n-k+1):
    kmer = smallseq[index:index+k]
    kmers.append(kmer)

In [36]:
print(kmers)

[b'yigdfrciql', b'igdfrciqlv', b'gdfrciqlvn', b'dfrciqlvns', b'frciqlvnsn', b'rciqlvnsng', b'ciqlvnsnga', b'iqlvnsngan', b'qlvnsnganv', b'lvnsnganvs', b'vnsnganvsa']


If we would like to save these new sequences for a later date, we can use SeqIO to write them to a file

In [57]:
#decode kmers to strings
records = []
for index, kmer in enumerate(kmers):
    kmer_string = kmer.decode('utf8').upper()
    kmer_seq = Seq(kmer_string)
    record = SeqRecord(kmer_seq, f"Record #{index + 1}")
    records.append(record)

SeqIO.write(records, "./output.fasta", "fasta")

11

### Hashing

This problem also asks us to approximate how many distinct elements there are in a list by using a family of hash functions. Recall that a hash function takes some key value (e.g., the bytestring representation of an n-mer) and outputs some numerical value in a given range.

For example, each hash function in the family Professor McDougal gave us in the assignment will take a $15$-mer string and output an integer value between $0$ and $3,587,798,434,663$, which he represents as a hexadecimal number, scale=0x07ffffffff. For our purposes, we don't need to understand the internal workings of these hash functions, we only need to assume that the output distributions of these hash functions are uniform on the range $[0, \textit{scale}-1]$. 

Suppose that we have a set of distinct values $\{x_1,...,x_n\}$, and we'd like to approximate the size of this set, $n$, using a hash function, $h$. $h$ should then map our input set to output values $\{h_1,...,h_m\}$, where $m\leq n$. 

In principle, we're hoping that $h$ maps any two distinct input values to two distinct output values so that our input and output sets are the same size. In practice, $h$ may map two or more distinct inputs to the same output. We call such an event a "hash collision".

The more hash collisions we have, the farther $m$ will be from $n$, and thus the less accurate our approximation will be. Therefore want to generate as few collisions as possible. Generally speaking, the more possible output values that a hash function has, the less likely that a collision will occur. This is why we have set our scale value to such a large number.

Once we have our sample of uniformly distributed hash values, it's time to do a little statistics. We discussed in lecture that if $\{y_1,...,y_m\}$ are uniformly distributed across the unit interval of real numbers $[0,1]$, then the expected minimum value of this set is calculated as follows:

$\mathbb{E}[\min y_i] = \frac 1{m+1}$. 

Bonus: the math behind this fact is explained [here](https://danieltakeshi.github.io/2016/09/25/the-expectation-of-the-minimum-of-iid-uniform-random-variables/) for those of you who wish to know

Since our hash values $\{h_1,...,h_m\}$ are uniformly distributed across the interval $[0,\textit{scale}-1]$, we should then expect that the minimum hash value is computed as follows

$\mathbb{E}[\min h_i] = \frac {\textit{scale}-1}{m+1}$.

From this equation, we then solve for $m$ with a bit of algebra to get the following:

$m = \frac{\textit{scale}-1}{\mathbb E [\min h_i]} - 1$

Since we are only able to directly measure the actual value of $\min h_i$, instead of the expected value, we can only use the following approximation:

$m \approx \frac {scale -1}{\min h_i} - 1$

To get a more accurate approximation of $m$, we can run this computation many times, each time using a different hash function, then take our final value of $m$ to be the median of these approximate values. This is why the professor gave us a whole family of hash functions.

## Problem 3

Your friend is running out of memory. What kind of data structure is your friend using? Can you think of any less memory-intensive alternatives?

## Problem 4

Now you get to look for an interesting dataset. Here are several good sites where you can find free data, but feel free to look elsewhere if you prefer:
1. [Kaggle](https://www.kaggle.com/)
2. [data.gov](https://data.gov/)
3. [data.ct.gov](https://data.ct.gov/)