### CS661 Assign10

Name: Pranit Kumbhar

# Extracting HTML Data
The US Treasury Secretary data used in Assign9 came from the US government web page `Prior-Secretaries-US-Department-of-the-Treasury.html`.  Google the web page, display it in a browser, save it on your local machine, then upload to your Colbal account.

The secretaries data can be found in an HTML `<table>` element.

Tabular data in an HTML page is often structured using the `<table>` HTML element.  The structure of a `<table>` is:

     <table>
      <thead>
          <th>column headers</th>
          . . .
          <th>column headers</th>
          <!-- also other table metadata -->
      </thead>
      <tbody>
          <tr> <!-- a row of the table -->
              <th>row header</th> <!-- sometimes, but not consistent, same function as <td> -->
              <td>table cell data</td> <!-- column data cells in left to right order -->
              <td>table cell data</td>
              . . .
              <td>table cell data</td>
          </tr>
          <tr>
              <!-- another table row of <th> and <td> elements -->
          </tr>
      </tbody>
    </table>

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the easiest way to extract the data.  The data for each US Secretary of the Treasury is in each row of the table, in historical order.

Write a notebook (or program) to extract the data from the page and write a tab-separated output file, named `./your_last_name-cs661-assign10-data.tsv`, that contains records of the form:

    secretary_name tab president_name tab secretary_home_state tab term \n
    
for every combination of secretary and president.

### Extract Hints
* To use BeautifulSoup with the html.parser module, `import bs4`, then code `bs = bs4.BeautifulSoup(fd, 'html.parser')` where `fd` is an open file descriptor, and `bs` is the returned memory model of the parsed document
    * HTML tags are of type `bs4.Tag`, and tag names are the `.name` attribute; eg. `if not isinstance(tr, bs4.Tag) or tr.name != 'tr':`
    * `.find(tag-name)` or `.find_all(tag-name)` will return the next element with the named tag, or a list of all elements with the named tag
    * The children of a tag are available with the attribute `.children` which is an iterator
    * `.next_sibling` and `.previous_sibling` point to the next and previous tag or string content at the same level
    * `.next_siblings` and `.previous_siblings` are iterators
    * `.next_element` and `.previous_element` point to the next and previous HTML tags in a depth-first traversal of the tags
    * The `.text` attribute of an element is a concatentation of all the character content of all its child elements

Multiple presidents (and multiple terms) occur in a `<td>` element, nested within a paragraph `<p>`, separated by a break `<br/>`, as:

    <td>
    <p>John Adams<br />
    Jefferson</p>
    </td>

    If `el` refers to the `<td>` element, then `el.text` is the string `\nJohn Adams\n\t\t\tJefferson\n`
  
* Number of terms will always be the same as the number of presidents
* `<td>` elements can be identified by their sequence within their `<tr>` parent; keep a sequence counter
* The string method `.strip()` will remove leading and trailing whitespace from a string
* `re.split(pat, str)` will split a string on a regular expression, as does `str.split(ch)` with a string
  
If you add a code cell following your program and execute the OS command `wc`

    !wc ./name-cs661-assign10-data.tsv
  
the output should be close to

    106    1289    6725 ./name-cs661-assign10-data.tsv

meaning 106 lines, 1289 words, 6725 characters.


In [None]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


In [None]:
!ls -l drive/MyDrive/data/*

-rw------- 1 root root  10474 Nov  3 17:10  drive/MyDrive/data/cs661_assign8_PranitKumbhar.ipynb
-rw------- 1 root root  10590 Nov 10 22:02  drive/MyDrive/data/cs661_assign9_PranitKumbhar.ipynb
-rw------- 1 root root 196771 Nov 22 01:25 'drive/MyDrive/data/Prior Secretaries _ U.S. Department of the Treasury.html'
-rw------- 1 root root   6297 Nov  3 01:23  drive/MyDrive/data/ustreas-sec.csv


In [223]:
import bs4
import re
import os

def extract_html(obj):

  fd = open('drive/MyDrive/data/kumbhar-cs661-assign10-data.tsv', 'w', encoding='utf-8')

  table = bs.find('table')

  for row in table.tbody:
    if not isinstance(row, bs4.Tag) or row.name != 'tr':
      continue

    th_ = row.find('th')
    sec_and_state  = re.split(r',\s', th_.text.strip('\n'))

    if len(sec_and_state) == 3:
      secretary = ', '.join(sec_and_state[0:2])
    else:
      secretary = sec_and_state[0]

    if len(sec_and_state) >= 2:
      state = sec_and_state[-1]
    else:
      state = ''

    td_ = row.find_all('td')
    #print(td_)
    for i, td in enumerate(td_):
      if i==0:
        terms = re.split(r'\t+',td.text)
        #print(terms)
      elif i==1:
        president = re.split(r'\t+',td.text)

    for i in range(len(president)):
      fd.write('{}\t{}\t{}\t{}'.format(secretary, president[i].strip('\n'), state, terms[i].strip(',')))


fp = 'drive/MyDrive/data/Prior Secretaries _ U.S. Department of the Treasury.html'
fd = open(fp, 'r', encoding='utf-8')
bs = bs4.BeautifulSoup(fd, 'html.parser')
fd.close()

extract_html(bs)




In [221]:
!wc ./kumbhar-cs661-assign10-data.tsv

  79 1184 6897 ./kumbhar-cs661-assign10-data.tsv
