# Processing HTML Exercise

### Prelude
Import pandas, numpy, and BeautifulSoup (from bs4)

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

Open the file 'Congressional Biographical Directory_htm.htm' from the course materials in both a browser and in notepad.  In the browser, go to the inspector and examine the structure of the file.

In [2]:
import requests
response=requests.get('http://mysfi.s3-website-us-east-1.amazonaws.com/python_for_data_analysis/Data%20Files/Congressional%20Biographical%20Directory_htm.htm')
response.content

b'\r\n<html>\r\n<head><title>Congressional Biographical Directory</title></head>\r\n<body background="paper1.gif" text="#000000">\r\n<table BORDER=1 CELLPADDING=0 CELLSPACING=0 WIDTH="100%">\r\n<tr>\r\n<td width=100% valign=TOP bgcolor=#990000><center><img src="topbanner.jpg" border=0></center></td>\r\n</table>\r\n\r\n\r\n<center>\r\n<br>\r\n<b><i>Click Member Name to view Biography</i></b>\r\n<br><br>\r\n<table border=1 cellspacing=2 cellpadding=3>\r\n<tr><th>Member Name</th><th>Birth-Death</th><th>Position</th>\r\n<th>Party</th><th>State</th><th>Congress<br>(Year)</th></tr>\r\n\r\n<tr><td><A HREF="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035">ADAMS, George Madison</A></td><td>1837-1920</td>\n<td>Representative</td><td>Democrat</td><td align="center">KY</td><td align="center">43<br>(1873-1874)</td></tr>\n<tr><td><A HREF="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074">ALBERT, William Julian</A></td><td>1816-1879</td>\n<td>Representative</td><td>Re

### Part 1 Process the File with Beautiful Soup

**1a** Create a BeautifulSoup object, called soup, that contains the contents of the html file

In [5]:
soup = BeautifulSoup(response.content)

**1b** Find all of the links in the soup object

In [6]:
links=soup.find_all('a')
links[0:5]

[<a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035">ADAMS, George Madison</a>,
 <a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074">ALBERT, William Julian</a>,
 <a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077">ALBRIGHT, Charles</a>,
 <a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079">ALCORN, James Lusk</a>,
 <a href="http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160">ALLISON, William Boyd</a>]

**1c** Get all the urls (href properties) from links as a list

In [14]:
urls=[link.get('href') for link in links]
urls[:5]

['http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035',
 'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074',
 'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077',
 'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079',
 'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160']

**1d** Get all the actual content for each link as a list

In [16]:
content=[link.contents for link in links]
content[0]

['ADAMS, George Madison']

### Part 2 - Loading the data into a dataframe

**2a** Create a dictionary with keys of 'URL' and 'content' with corresponding values of urls and content (the lists you just created)

In [18]:
allData={'URL':urls,'Name':content}
allData

{'URL': ['http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000035',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000074',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000077',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000079',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000160',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000172',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000262',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000274',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000283',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000304',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000309',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000327',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=A000344',
  'http://bioguide.congress.gov/scripts/biodisplay.pl?index=B000117',
  'http://bio

**2b** Convert the dictionary into a pandas Dataframe

In [25]:
df=pd.DataFrame(allData)
df.head()

Unnamed: 0,URL,Name
0,http://bioguide.congress.gov/scripts/biodispla...,"[ADAMS, George Madison]"
1,http://bioguide.congress.gov/scripts/biodispla...,"[ALBERT, William Julian]"
2,http://bioguide.congress.gov/scripts/biodispla...,"[ALBRIGHT, Charles]"
3,http://bioguide.congress.gov/scripts/biodispla...,"[ALCORN, James Lusk]"
4,http://bioguide.congress.gov/scripts/biodispla...,"[ALLISON, William Boyd]"


**2c** Notice the name column is actually a bunch of one-item lists.  Remove those items from the list so that the column is a column of strings, instead of strings nested in lists.  Hint: use apply

In [26]:
def getFirst(x):
    return x[0]
df['Name']=df['Name'].apply(getFirst)
df['Name'].head()

0     ADAMS, George Madison
1    ALBERT, William Julian
2         ALBRIGHT, Charles
3        ALCORN, James Lusk
4     ALLISON, William Boyd
Name: Name, dtype: object

**2d** Convert the name column to title case

In [28]:
df['Name']=df['Name'].str.title()
df.head()

Unnamed: 0,URL,Name
0,http://bioguide.congress.gov/scripts/biodispla...,"Adams, George Madison"
1,http://bioguide.congress.gov/scripts/biodispla...,"Albert, William Julian"
2,http://bioguide.congress.gov/scripts/biodispla...,"Albright, Charles"
3,http://bioguide.congress.gov/scripts/biodispla...,"Alcorn, James Lusk"
4,http://bioguide.congress.gov/scripts/biodispla...,"Allison, William Boyd"


### Part 3: Load the data using pandas read_html

**3a** Load data from below URL using read_html.  How many tables are there?

http://mysfi.s3-website-us-east-1.amazonaws.com/python_for_data_analysis/Data%20Files/Congressional%20Biographical%20Directory_htm.htm

In [4]:
allTables=pd.read_html('http://mysfi.s3-website-us-east-1.amazonaws.com/python_for_data_analysis/Data%20Files/Congressional%20Biographical%20Directory_htm.htm')
numTables=len(allTables)
numTables

1

**3b** Get the first dataframe returned by read_html in the above exercise and store it as df.


In [5]:
df=allTables[0]
df.head()

Unnamed: 0,Member Name,Birth-Death,Position,Party,State,Congress(Year)
0,"ADAMS, George Madison",1837-1920,Representative,Democrat,KY,43(1873-1874)
1,"ALBERT, William Julian",1816-1879,Representative,Republican,MD,43(1873-1874)
2,"ALBRIGHT, Charles",1830-1880,Representative,Republican,PA,43(1873-1874)
3,"ALCORN, James Lusk",1816-1894,Senator,Republican,MS,43(1873-1874)
4,"ALLISON, William Boyd",1829-1908,Senator,Republican,IA,43(1873-1874)


### If you have time

Using BeautifulSoup, Extract the party from the html and add it as a column to your dataframe

In [46]:
partySpots=[t.contents[4]  for t in soup.find_all('tr') if len(t.contents)>=5]
noHeaderParty=partySpots[1:]
parties=[container.contents[0] for container in noHeaderParty]
df['Party']=parties
df.head()

Unnamed: 0,URL,Name,Party
0,http://bioguide.congress.gov/scripts/biodispla...,"Adams, George Madison",Democrat
1,http://bioguide.congress.gov/scripts/biodispla...,"Albert, William Julian",Republican
2,http://bioguide.congress.gov/scripts/biodispla...,"Albright, Charles",Republican
3,http://bioguide.congress.gov/scripts/biodispla...,"Alcorn, James Lusk",Republican
4,http://bioguide.congress.gov/scripts/biodispla...,"Allison, William Boyd",Republican


Display just the Democrats

In [48]:
dems=df[df['Party']=='Democrat']
dems.head()

Unnamed: 0,URL,Name,Party
0,http://bioguide.congress.gov/scripts/biodispla...,"Adams, George Madison",Democrat
7,http://bioguide.congress.gov/scripts/biodispla...,"Archer, Stevenson",Democrat
8,http://bioguide.congress.gov/scripts/biodispla...,"Armstrong, Moses Kimball",Democrat
9,http://bioguide.congress.gov/scripts/biodispla...,"Arthur, William Evans",Democrat
10,http://bioguide.congress.gov/scripts/biodispla...,"Ashe, Thomas Samuel",Democrat


Which party had the majority?

In [49]:
df['Party'].value_counts()

Republican                265
Democrat                  122
Liberal Republican          6
Independent Republican      1
ME                          1
Unionist                    1
                            1
Name: Party, dtype: int64

Unnamed: 0,Member Name,Birth-Death,Position,Party,State,Congress(Year)
0,"ADAMS, George Madison",1837-1920,Representative,Democrat,KY,43(1873-1874)
1,"ALBERT, William Julian",1816-1879,Representative,Republican,MD,43(1873-1874)
2,"ALBRIGHT, Charles",1830-1880,Representative,Republican,PA,43(1873-1874)
3,"ALCORN, James Lusk",1816-1894,Senator,Republican,MS,43(1873-1874)
4,"ALLISON, William Boyd",1829-1908,Senator,Republican,IA,43(1873-1874)
...,...,...,...,...,...,...
392,"WOODFORD, Stewart Lyndon",1835-1913,Representative,Republican,NY,43(1873-1874)
393,"WOODWORTH, Laurin Dewey",1837-1897,Representative,Republican,OH,43(1873-1874)
394,"WRIGHT, George Grover",1820-1896,Senator,Republican,IA,43(1873-1874)
395,"YOUNG, John Duncan",1823-1910,Representative,Democrat,KY,43(1873-1874)
