# Webscraping - Lecture Code

## The Task:

Gather information Illinois' elected state legislators here: http://www.ilga.gov/senate/default.asp

## The Tools:

1. [Requests](http://docs.python-requests.org/en/latest/user/quickstart/)
2. [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [1]:
# import required modules
import requests
from bs4 import BeautifulSoup

## Step 1: Make a Get Request and Read in HTML

We use `requests` library to:
1. make a GET request to the page
2. read in the html of the page

In [81]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# read the content of the server’s response
src = req.text

## Step 2: Soup it

Now we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

In [135]:
# parse the response into an HTML tree
soup = BeautifulSoup(src)
# take a look
print(soup.prettify()[:1000])

<html lang="en">
 <head>
  <title>
   Illinois General Assembly - Senate Members
  </title>
  <link href="/style/lis.css" rel="stylesheet" type="text/css"/>
  <link href="/style/print.css" media="print" rel="stylesheet" type="text/css"/>
  <link href="http://info.er.usgs.gov/public/gils/gilsexec.html" rel="GILS"/>
  <link href="/LISlogo1.ico" rel="Shortcut Icon"/>
  <script language="JavaScript" type="text/javascript">
   <!--

if(window.event + "" == "undefined") event = null;
function HM_f_PopUp(){return false};
function HM_f_PopDown(){return false};
popUp = HM_f_PopUp;
popDown = HM_f_PopDown;

//-->
  </script>
  <!--
    option explicit
  -->
  <meta content='(PICS-1.1 "http://www.weburbia.com/safe/ratings.htm" l r (s 0))' http-equiv="PICS-Label"/>
  <meta content="Government" name="classification"/>
  <meta content="Global" name="distribution"/>
  <meta content="General" name="rating"/>
  <meta content="IL" name="contactState"/>
  <meta content="Illinois General Assembly" name="

## Step 3: Find Elements

BeautifulSoup has a number of functions to find things on a page. Like other webscraping tools, Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors


Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [136]:
# find all elements in a certain tag
# these two lines of code are equivilant

# soup.find_all("a")

**NB**: Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. These two lines of code are equivalent:

In [137]:
# soup.find_all("a")
# soup("a")

That's a lot! Many elements on a page will have the same html tag. For instance, if you search for everything with the `a` tag, you're likely to get a lot of stuff, much of which you don't want. What if we wanted to search for HTML tags ONLY with certain attributes, like particular CSS classes? 

We can do this by adding an additional argument to the `find_all`

In the example below, we are finding all the `a` tags, and then filtering those with `class = "sidemenu"`.

In [138]:
# Get only the 'a' tags in 'sidemenu' class
soup("a", class_="sidemenu")

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senate/audvid.asp">  Live Audio/Video  </a>]

Oftentimes a more efficient way to search and find things on a website is by **CSS selector.** For this we have to use a different method, `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use "a.sidemenu" as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [139]:
# get elements with "a.sidemenu" CSS Selector.
soup.select("a.sidemenu")

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senate/audvid.asp">  Live Audio/Video  </a>]

## Step 4. Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Oftentimes this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [140]:
# this is a list
soup.select("a.sidemenu")

# we first want to get an individual tag object
first_link = soup.select("a.sidemenu")[0]

# check out its class
type(first_link)

bs4.element.Tag

It's a tag! Which means it has a `text` member:

In [141]:
print(first_link.text)

  Members  


Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

You can access a tag’s attributes by treating the tag like a dictionary:

In [142]:
print(first_link['href'])

/senate/default.asp


In [143]:
# Get just the href (url) attribute from the first 10 links
for link in soup('a')[:10]:
    print(link['href'])

/default.asp
/
/legislation/
/senate/
/house/
/mylegislation/
/sitemap.asp
/senate/default.asp
/senate/committees/default.asp
/senate/schedules/default.asp


## Let's Do This.

Believe it or not, that's all you need to scrape a website. Let's apply these skills to scrape http://www.ilga.gov/senate/default.asp

First, make the get request

In [144]:
# make a GET request
req = requests.get('http://www.ilga.gov/senate/default.asp')
# read the content of the server’s response
src = req.text
# soup it
soup = BeautifulSoup(src)

Now let's try to get a list of rows in that table. Rows are identified by the `tr` tag.

In [153]:
# get all tr elements
rows = soup.find_all("tr")
len(rows)

81

But remember, `find_all` gets all the elements with the `tr` tag. We can use smart CSS selectors to get only the rows we want.

In [147]:
# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')
print(rows[2].prettify())

<tr>
 <td bgcolor="white" class="detail" width="40%">
  <a href="/senate/Senator.asp?GA=99&amp;MemberID=2130">
   Pamela J. Althoff
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenatorBills.asp?MemberID=2130">
   Bills
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenCommittees.asp?MemberID=2130">
   Committees
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  32
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  R
 </td>
</tr>



We can use the `select` method on anything. Let's say we want to find everything with the CSS selector `td.detail` in an item of the list we created above.

In [150]:
# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# Look at just the third item
row = rows[2]

# select only those 'td' tags with class 'detail'
detailCells = row.select('td.detail')

detailCells

[<td bgcolor="white" class="detail" width="40%"><a href="/senate/Senator.asp?GA=99&amp;MemberID=2130">Pamela J. Althoff</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?MemberID=2130">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?MemberID=2130">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">32</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

Most of the time, we're interested in the actual **text** of a website, not its tags. Remember, to get the text of an HTML element, use the `text` member.

In [29]:
# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# Look at just the third item
row = rows[2]

# select only those 'td' tags with class 'detail'
detailCells = row.select('td.detail')

# Keep only the text in each of those cells
rowData = [cell.text for cell in detailCells]

Pamela J. Althoff
Bills
Committees
32
R


Now we can combine the beautifulsoup tools with our basic python skills to scrape an entire web page.

In [151]:
# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# Look at just the third item
row = rows[2]

# select only those 'td' tags with class 'detail'
detailCells = row.select('td.detail')

# Keep only the text in each of those cells
rowData = [cell.text for cell in detailCells]

# check em out
print(rowData[0]) # Name
print(rowData[3]) # district
print(rowData[4]) # party

Pamela J. Althoff
32
R


Let's use a for loop to get 'em all!

In [152]:
# Create empty list to store our data
members = []

# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# loop through all rows
for row in rows:
    # select only those 'td' tags with class 'detail'
    detailCells = row.select('td.detail')
    
    # get rid of junk rows
    if len(detailCells) is not 5: 
        continue
        
    # Keep only the text in each of those cells
    rowData = [cell.text for cell in detailCells]
    
    # Collect information
    name = rowData[0]
    district = int(rowData[3])
    party = rowData[4]
    
    # Store in a tuple
    tup = (name,district,party)
    
    # Append to list
    members.append(tup)

### Challenge 1

Make a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. 

In [34]:
def get_members(url):
    # YOUR CODE HERE
    pass

In [36]:
# Test you code!

# get_members('http://www.ilga.gov/senate/default.asp?GA=98')

[(u'Pamela J. Althoff', 32, u'R'),
 (u'Jason A. Barickman', 53, u'R'),
 (u'Scott M Bennett', 52, u'D'),
 (u'Jennifer Bertino-Tarrant', 49, u'D'),
 (u'Daniel Biss', 9, u'D'),
 (u'Tim Bivins', 45, u'R'),
 (u'William E. Brady', 44, u'R'),
 (u'Melinda Bush', 31, u'D'),
 (u'James F. Clayborne, Jr.', 57, u'D'),
 (u'Jacqueline Y. Collins', 16, u'D'),
 (u'Michael Connelly', 21, u'R'),
 (u'John J. Cullerton', 6, u'D'),
 (u'Thomas Cullerton', 23, u'D'),
 (u'Bill Cunningham', 18, u'D'),
 (u'William Delgado', 2, u'D'),
 (u'Kirk W. Dillard', 24, u'R'),
 (u'Dan Duffy', 26, u'R'),
 (u'Gary Forby', 59, u'D'),
 (u'Michael W. Frerichs', 52, u'D'),
 (u'William R. Haine', 56, u'D'),
 (u'Don Harmon', 39, u'D'),
 (u'Napoleon Harris, III', 15, u'D'),
 (u'Michael E. Hastings', 19, u'D'),
 (u'Linda Holmes', 42, u'D'),
 (u'Mattie Hunter', 3, u'D'),
 (u'Toi W. Hutchinson', 40, u'D'),
 (u'Mike Jacobs', 36, u'D'),
 (u'Emil Jones, III', 14, u'D'),
 (u'David Koehler', 46, u'D'),
 (u'Dan Kotowski', 28, u'D'),
 (u'Dar

### Challenge 2

Given the `senateMembers`  list, create a new dictionary `members_dict` which has as its keys the district number (e.g. ` 32`) and as its values, the entire tuple as returned by `get_members()` (name, district number, party, url). We can do this because the district number is a unique identifier for each senator.

Calling `members_dict[32]`, for example, should return the 4-tuple:

```(u'Pamela J. Althoff',
32,
u'R',
'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True')
```.

In [155]:
# YOUR CODE HERE