# Navigate and collect large websites with Selenium
In this notebook, we will repeat basic selenium, but go into more detail about how to build a flexible scraper that can navigate through a large number of pages on a given website. 

### Starting up

We start with setting up our webdriver and logging onto the GPO's page once more.

In [83]:
import pandas
import urllib
from selenium import webdriver

gpo=webdriver.Chrome()
gpo.get("http://www.gpo.gov/fdsys/browse/collection.action?collectionCode=CHRG")


### Location, location, location

Locating elements in the website is central to using Selenium. Whether you want to extract information or simply navigate around, you need to locate the relevant element on the page so Selenium can capture, manipulate or click it.

Selenium offers <a href="https://selenium-python.readthedocs.io/locating-elements.html" target=_blank>several different ways</a> to find an element. Each way is built on one of the following element characteristics:
- the element's tag or attribute:
    - tag name (``find_element_by_tag_name``): Supply the HTML tag of your element.
    - class attribute value (``find_element_by_class_name``): Specify the value of the <i>class</i> attribute of your element.
    - ID attribute value (``find_element_by_id``): Specify the value of the <i>id</i> attribute of your element.
    - name attribute value (``find_element_by_name``): Specify the value of the <i>name</i> attribute of your element.
- the element's path: 
    - CSS selector (``find_element_by_css_selector``): Supply the CSS selector to locate your element.
    - XPath (``find_element_by_xpath``): Supply the XPath for your element.
- the element's displayed link text:
    - ID (``find_element_by_link_text``): Specify the entire text on the link you want to locate.
    - ID (``find_element_by_partial_link_text``): More flexible as the above since it will return partial matches.

Note that each of these commands also exists in a plural e.g. "``find_element``<code><b>s</b></code>``_by_partial_link_text``". The difference between the singular and the plural is the number of elements the search returns. In the singular form, it will only return the first element that matches your search terms. In the plural form, it will return all elements that match them in the form of a list.

### Working through a list

Let's use the described search functions and apply one of them to the page in our browser. For starters, let's get Selenium to click on all the links to the congressional sessions in the navigation menu on the page.

Before Selenium can click on anything, we have to collect the location. The most convenient way to do that is probably the search based on partial link texts. Searching by "Congress" is the obvious candidate. 

But wait: there are more links on this page that include that word (e.g. "About the Congressional Hearings"). What makes the session links unique is the bracket that follows the word "Congress" i.e. "Congress (". 

Further note that we want to collect more than one location and thus want to use the command in its plural.

In [84]:
congresses=gpo.find_elements_by_partial_link_text("Congress (")

So what did Selenium come back with? A list of elements. A list of 16 elements which is the number of congressional sessions currently published on the GPO's page (the 101st sessions is missing for some reaseon). The contents of that list are the locations of these elements on the page. 

Let's have a look:

In [85]:
print('Number of elements: ', len(congresses))
congresses

Number of elements:  16


[<selenium.webdriver.remote.webelement.WebElement (session="2ec715e6600d97aaeb6ced26e33c2a22", element="0.55124397534311-1")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2ec715e6600d97aaeb6ced26e33c2a22", element="0.55124397534311-2")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2ec715e6600d97aaeb6ced26e33c2a22", element="0.55124397534311-3")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2ec715e6600d97aaeb6ced26e33c2a22", element="0.55124397534311-4")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2ec715e6600d97aaeb6ced26e33c2a22", element="0.55124397534311-5")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2ec715e6600d97aaeb6ced26e33c2a22", element="0.55124397534311-6")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2ec715e6600d97aaeb6ced26e33c2a22", element="0.55124397534311-7")>,
 <selenium.webdriver.remote.webelement.WebElement (session="2ec715e6600d97aaeb6ced26e33c2a22", element="0.55124

A list with 16 elements containing the locations of our congressional session links.

The task at hand is to get Selenium to click on all the links to congressional sessions. Let's start small by clicking on the first link, the one relating to the 115th Congress.

In [86]:
congresses[0].click()

Nice!
Now the 114th!

In [87]:
congresses[1].click()

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=64.0.3282.167)
  (Driver info: chromedriver=2.35.528161 (5b82f2d2aae0ca24b877009200ced9065a772e73),platform=Windows NT 10.0.16299 x86_64)


<i>Failure!</i> But why?!

Because the page changed. Clicking on the link to the 115th Congress changed the location of all other links on the page. Selenium can't find them anymore. Its map is no longer accurate. 

Going through a list means to search many times. Every time the site is manipulated, one needs to search again for the elements with a partial link text "Congress (". 

So let's do that and try again.

In [88]:
congresses=gpo.find_elements_by_partial_link_text("Congress (")
congresses[1].click()

Much better.

One could now write down this code 16 times, once for every congressional session. But that's not desirable as a lengthy code increases the risk for typos, among other problems. 

The more elegant solution is to loop through the list of Congresses we received in return to our search query. True, clicking through the elements does not work, but we can still exploit the length of the list. 

Recall that the returned list included 16 elements. One for every Congress currently displayed on the GPO's page. We can use that length to set up a counter that repeats our little exercise 16 times. That is, we can write a loop that repeats "search the element" + "click the one which the counter indicates" 16 times. 

Here is how it goes:

In [89]:
counter=0
congresses=gpo.find_elements_by_partial_link_text("Congress (")

for counter in range(counter, len(congresses)):
    congresses=gpo.find_elements_by_partial_link_text("Congress (")
    congresses[counter].click()

Could you do it faster on your own?

### Working through a list of lists

There is no need to stop here. To have our computer click through many lists, we can simply nest various such loops into each other. The logic is always the same. First identify the number of elements you need selenium to click through; then set up the counter and loop the loop.

Our second target now is to have selenium click through all the congressional hearing types (House, Senate and Joint Hearings) once it has opened up a session. Plain sailing? Let's first try only the 115th Congress.

In [90]:
congresses=gpo.find_elements_by_partial_link_text("Congress (")
congresses[0].click()

hearing_type=gpo.find_elements_by_partial_link_text("Hearings")
hearing_type[0].click()

I am sure you saw it coming. Using a word as common as "Hearings" on this page is probably not your best strategy. In general, when clicking through lengthy navigation menus like the present one, it may be a better idea to use XPath or CSS selectors to locate the desired element. Unless you know what link texts are on the pages you are about to click through, using (partial) link text to search for elements is not without the risk of being thrown off course.

So here the version with XPATH.<br>
<i>(Please close any additional tab or window that may have opened in the meantime due to clicking on the wrong link.)</i>

In [91]:
congresses=gpo.find_elements_by_partial_link_text("Congress (")
congresses[1].click()

hearing_type=gpo.find_elements_by_xpath(".//div[@class='level2 browse-level']/a")
hearing_type[0].click()

Let's loop this.

To keep things a little quieter, let's use the back button to return the original position of our navigation menu.

In [92]:
gpo.back()
gpo.back()


Off we go:

In [93]:
counter=0
congresses=gpo.find_elements_by_partial_link_text("Congress (")

for counter in range(counter, len(congresses)):
    congresses=gpo.find_elements_by_partial_link_text("Congress (")
    congresses[counter].click()
    
    
    hearing_type=gpo.find_elements_by_xpath(".//div[@class='level2 browse-level']/a")
    type_counter=0
    
    for type_counter in range(type_counter, len(hearing_type)):
        hearing_type=gpo.find_elements_by_xpath(".//div[@class='level2 browse-level']/a")
        hearing_type[type_counter].click()
        gpo.back()
    
    gpo.back() 
    
    

KeyboardInterrupt: 


Enjoy!

In [82]:
gpo.quit()