
### 1.A Basic HTML
### 1.A.1 Tags

The pieces of HTML documents that carry the commands for the browser are referred to as "tags". Tags are separated from the text through angle brackets ("<",">"). Also, most commands come in pairs consisting of an opening and an end tag. For example, when the browser encounters the opening tag "``<b>``", the browser displays all text in bold type until it encounters the end tag "``</b>``".

The HTML sequence "``<b>This is bold.</b>``" thus yields:   
<center><b>This is bold.</b></center>


Because tags usually come in pairs, HTML documents have a nested structure when multiple tags are combined. For instance, if one wants to underline part of the above example in HTML one would write "``<b>This <u>is</u> bold.</b>``". In this example, the tag for underline (``<u>``) is nested inside the tag for bold and your browser displays:<br>
<center><b>This <u>is</u> bold.</b></center>

Before looking at a sample HTML document in more detail, it is useful to recognize a few commonly used tags. You will encounter some of them again later on in our tutorial.


<table>
<thead>
<tr>
<th>Tag name</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>&lt;head&gt;</code></td>
<td>The head of a page includes general information such as its authors, language etc.</td>
</tr>
<tr>
<td><code>&lt;body&gt;</code></td>
<td>All the content you see in the browser is inside the body tags.</td>
</tr>
<tr>
<td><code>&lt;p&gt;</code></td>
<td>A paragraph of text.</td>
</tr>
<tr>
<td><code>&lt;br&gt;</code></td>
<td>A break between two lines of text.</td>
</tr>
<tr>
<td><code>&lt;b&gt;</code>, <code>&lt;i&gt;</code>, <code>&lt;u&gt;</code></td>
<td>Display text in bold, italics or underlined.</td>
</tr>
<tr>
<td><code>&lt;a href="gpo.html"&gt;</code></td>
<td>A link to the page gpo.html.</td>
</tr>
<tr>
<tr>
<td><code>&lt;img src="gpo.jpg"&gt;</code></td>
<td>Displays the picture gpo.jpg.</td>
</tr>
<tr>
<td><code>&lt;table&gt;</code>, <code>&lt;tr&gt;</code>, <code>&lt;td&gt;</code></td>
<td>A table including its rows (tr) and cells (td).</td>
</tr>
<tr>
<td><code>&lt;div&gt;</code></td>
<td>Divides an HTML document into different parts. Usually used for layout purposes.</td>
</tr>
<tr>
<td><code>&lt;span&gt;</code></td>
<td>Similar to <code>&lt;div&gt;</code> tag, but only for small parts of the code.</td>
</tr>
</tbody></table>


### 1.A.2 Attributes

Some tags include additional information besides the command. See for example the ``<a>`` or the ``<img>`` tags in the table above. This additional information is referred to as an "attribute" of the tag. The hyperlink tag ``<a>`` always contains the URL of the destination page as an attribute. Similarly, the tag ``<img>`` includes the location of the image to display. 

Another common role of an attribute is to assign specific IDs to a tag. A specific ID helps the author locate and manipulate special sections of the code more quickly. For example, ID tags are often used to assign special formatting to selected sections which all share a common ID for that purpose.

### 1.A.3 HTML document structure

Now that the basic components of HTML documents are introduced, let's look at how they come together in an HTML document. The example we are going to study is the following excerpt from the above GPO screenshot:

<table class="browse-node-table">

    <tr>
        <td colspan="2">
            <span>
                S. Hrg. 113 - DEPARTMENT OF DEFENSE APPROPRIATIONS FOR FISCAL YEAR 2014
            </span>
        </td>
    </tr>

    <tr>
        <td>
            <span class="results-line2">
                Appropriation. Wednesday, April 24, 2013.
            </span>
        </td>

        <td>
            <a href="https://www.gpo.gov/fdsys/pkg/CHRG-113shrg39104553/pdf/CHRG-113shrg39104553.pdf" target="_blank">
                PDF
            </a>
            |
            <a href="https://www.gpo.gov/fdsys/pkg/CHRG-113shrg39104553/html/CHRG-113shrg39104553.htm" target="_blank">
                Text
            </a>
            | 
            <a href="https://www.gpo.gov/fdsys/search/pagedetails.action?collectionCode=CHRG&browsePath=113%2FSENATE%2FCommittee+on+Appropriations&granuleId=CHRG-113shrg39104553&packageId=CHRG-113shrg39104553&fromBrowse=true" target="_blank"> 
               More
            </a>

        </td>
    </tr>


    <tr>
        <td colspan="2">&nbsp;</td>
    </tr>

</table>


This excerpt contains the basic information for the first Hearing (as displayed above). It is organized as a table with three rows. The first row consists of only one cell which contains the Hearing title. The second row contains two cells: the left cell includes more information on the Hearing (e.g. the date) and the right cell includes links to the different formats of the Hearing transcript (e.g. the PDF). The third row again consists of only one cell and this cell is left blank.

Now remember that we want to give the computer simple directions such as "Please click on the second link in the right-hand cell of the above table's second row ("Text")." This would be enough for you to locate the link to this Hearing's transcript. Eventually, we are going to write very similar instructions for the program; just using HTML tags rather than the words "table", "cell" or "row".

However, to formulate these directions it does not suffice to give the name of the tag that contains the desired URL. Most HTML tags appear several times inside the same page. There will be an ``<a>`` for ever link, a ``<p>`` for every new paragraph, and so on. Using just the tag would be as precise as finding the address of "John Smith" by looking up his name in New York's phone book. To provide precise information, we also need to understand the internal structure of an HTML code much like the way we would use the street address or area to locate someone with a common name.

For HTML documents, the nested structure can be visualized using a horizontal hierarchy. In this visualization, tags from the same nest are displayed with the same horizontal indentation. For an intuitive introduction to HTML document structure, let’s revisit the Hearing excerpt above and look at the tags behind that table with three rows through the following vizualisation. Before looking at the code for the entire table, let's first understand the top row:
<br>

<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px;background-color:#ddd">&lt;tr id="this-is-the-1st-row"&gt;
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;td colspan="2" id="this-is-the-only-cell"&gt;
    <div style="width:90%;border:0px dashed grey;padding:5px;margin:5px">S. Hrg. 113 - DEPARTMENT OF DEFENSE APPROPRIATIONS FOR FISCAL YEAR 2014</div>
&lt;/td&gt;
</div>
&lt;/tr&gt;
</div>

Note that besides indentation, tags from the same nest are surrounded by the same dashed line. The excerpt starts with a the opening tag for a table row (``<tr>``) and ends with the resepective closing tag (``</tr>``).
The row includes one cell (``<td>`` → ``</td>``) and in that cell we find the text that we see in our browser.

There are two noteworthy things we can learn from the cell tags. First, the starting tag include two attributes. The first attribute ``colspan`` indicates to the browser across how many columns this cell is going spreading. The second attribute is one of the ID assignments alluded to above. The author could use this attribute to standardise the display of such cells throughout the entire page. For example, one could use this ID attribute to define that all cells with this ID should have a light blue background. When using ID attributes, the HTML author only has to define this once at the start of the HTML code and the browser will apply it whenever it encounters the ID inside the code.

The second noteworthy thing is a further bit of HTML jargon. We have already seen how to "tag" and "attribute" are comonly used in the context of HTML. The next piece is what to understand under the difference between "the value of an attribute" and "the value of a tag". The value of a tag is what is contained inside its nest i.e. between the opening and the end tag. So in the above excerpt, the Hearing title ("S. Hrg. 113 - DEPARTMENT OF DEFENSE APPROPRIATIONS FOR FISCAL YEAR 2014") is the value of the ``<td>`` tag. However, the value of the ID attribute of the ``<td>`` tag is "this-is-the-only-cell".

This piece of jargon will be useful when sending our computer out to collect information for us. Sometimes we are interested in the value of an attribute and sometimes in the value of a tag. Say we want to collect a link. In that case, the computer should collect the attribute of a specifc ``<a>`` tag for us. In other applications, we may want to collect the displayed text itself. In that case, we will ask the computer to come back with the value of, say, the ``<td>`` tag.

Note here that the value of a tag is not restricted to include text only. It also includes the tags contained in its nest. So in the above example, the value of the ``<tr>`` tag is <br>
"``<td colspan="2" id="this-is-the-only-cell"> S. Hrg. 113 - DEPARTMENT OF DEFENSE APPROPRIATIONS FOR FISCAL YEAR 2014 </td>``".
<br>
<br>


Now that we have seen the frist row of the above table, let's look at the code for the entire code. It reads:

<div style="width:99%;border:1px solid grey;padding:3px;margin:10px;background-color:#ddd">&lt;table class="browse-node-table"&gt;
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;tr id="this-is-the-1st-row"&gt;
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;td colspan="2" id="this-is-the-only-cell"&gt;
    <div style="width:90%;border:0px dashed grey;padding:5px;margin:0px">S. Hrg. 113 - DEPARTMENT OF DEFENSE APPROPRIATIONS FOR FISCAL YEAR 2014</div>
&lt;/td&gt;
</div>
&lt;/tr&gt;
</div>

<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;tr id="this-is-the-2nd-row"&gt;
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;td id="this-is-one-cell"&gt;
    <div style="width:90%;border:1px dashed grey;padding:5px;margin:5px">&lt;span class="results-line2"&gt;    <div style="width:90%;border:0px dashed grey;padding:5px;margin:5px">Appropriation. Wednesday, April 24, 2013.</div>
&lt;/span&gt;</div>
&lt;/td&gt;
</div>

<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;td id="this-is-another-cell"&gt;
    <div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;a href="https://www.gpo.gov/link.to.PDF" target="_blank"&gt;    <div style="width:90%;border:0px dashed grey;padding:5px;margin:5px">PDF</div>
&lt;/a&gt;</div>
    <div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;a href="https://www.gpo.gov/link.to.text" target="_blank"&gt;    <div style="width:90%;border:0px dashed grey;padding:5px;margin:5px">Text</div>
&lt;/a&gt;</div>
    <div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;a href="https://www.gpo.gov/link.to.more" target="_blank"&gt;    <div style="width:90%;border:0px dashed grey;padding:5px;margin:5px">More</div>
&lt;/a&gt;</div>
&lt;/td&gt;
</div>
&lt;/tr&gt;
</div>

<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;tr id="this-is-the-3rd-row"&gt;
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;td colspan="2" id="this-is-the-only-cell"&gt;
    <div style="width:90%;border:0px dashed grey;padding:5px;margin:5px">&nbsp;</div>
&lt;/td&gt;
</div>
&lt;/tr&gt;
</div>

&lt;/table&gt;
</div>

In this display, the original table's structure is clear to see. The entire excerpt is one large nest build from a ``<table>`` tag. In that nest, there are a three row nests (``<tr>``) and each of the row nests includes at least one cell nest (``<td>``). Of those cells, the one containing the links takes up the most space with each link having its own nest (``<a>``).

There is one last HTML convention we need to be aware of before we can go on to write directions: How does the browser know how to arrange the layout of values within the same tag? For example, why is the cell containing the links to the right of the other cell in that row? Same for the order of the links. Why does the browser display "PDF | Text | More" and not "More | Text | PDF" or some other order? 

The reason is simple: As it goes through the code, our browser arranges everything it finds from left to right. To avoid a display in a single long line, some tags force the browser to start a new one. In that new one, the browser continues to go from left to right until it is interrupted again. Tags that cause such breaks in text are ``<br>`` or ``<p>``, but also design elements such as tables or images. 

This left-to-right convention will help us when writing the directions. What it does is allow us to pinpoint a tag that is otherwise identical to those in the same nest. In our example, we are only interest in the "Text" link from the table above. The URL we are looking for is in an otherwise indistinguishable ``<a>`` tag. In this case, we are still lucky because we could use the word "Text" to make our directions precise. 

However, we would be out of luck in a case like:<br>
<center>"Download different versions <a href="https://www.gpo.gov/fdsys/pkg/CHRG-113shrg39104553/pdf/CHRG-113shrg39104553.pdf" target="_blank">here</a>, <a href="https://www.gpo.gov/fdsys/pkg/CHRG-113shrg39104553/html/CHRG-113shrg39104553.htm" target="_blank">here</a> or <a href="https://www.gpo.gov/fdsys/search/pagedetails.action?collectionCode=CHRG&browsePath=113%2FSENATE%2FCommittee+on+Appropriations&granuleId=CHRG-113shrg39104553&packageId=CHRG-113shrg39104553&fromBrowse=true" target="_blank">here</a>." </center><br>
On occassions like this, we can exploit the "left-to-right" convention and simply ask the computer to extract the attribute value of the second ``<a>`` tag. More generally, it is often useful to give directions by the number of the tag, rather than the value it contains. Often, we do not know the value of the relevant tag or the value of that tag is precisely what we want to collect.

With the basics of HTML in mind, we can now turn to the problem of how to translate "Please click on the second link in the right-hand cell of the above table's second row ("Text")." into a command the computer understands. 



### 1.B XPath: The directory of HTML

While your computer may not understand verbal directions (for now), it understands XPath. XPath, the "XML Path Language", allows you to identify a particular tag, attribute or sets of them inside an HTML code. As the name implies, we will communicate our directions by giving the computer the path it should follow inside the HTML document in order to get to the location of what we want to collect. There are various ways to formulate our directions in XPath. Please consult <a href="https://www.w3schools.com/xml/xpath_intro.asp">this tutorial</a> for a detailed exposition of XPath and its many useful properties. For the purpose of this tutorial, we will start with the most specific version of using an XPath and then introduce two shortcuts.

#### 1.B.1 The complete path
The most specific way to ask the programme to "go to the second link in the right-hand cell of the above table's second row" is to literally spell out the entire path through the document from start to finish. 

The rules are simple:<br>
1. We only include the opening tags. The program will just ignore closing tags it sees inside the code along the way.<br>
2. The path only includes tags where the programme has to move into a new nest (or towards the right in the above visualization).<br>
3. We use numbers in squared brackets to identify a specific tag in cases where there are multiple ones of the same kind within the same nest.<br>
4. We connect the different pieces of our path with backslashes in-between each tag.

Voilà! 

Let's apply this to our example. We spell it out alongside the XPath again to see where the different parts come from:
<table style="width:90%;border:0px;fill:0px;padding:5px;margin:0px">
    <tr style="width:30%;border:0px;fill:0px">
        <td style="width:30%;border:0px;fill:0px">
          <span style="margin-left:27px">``table``</span><br> 
          <span style="margin-left:82px">``tr[2]``</span><br>
          <span style="margin-left:130px">``td[2]``</span><br>
          <span style="margin-left:179px">``a[2]``</span><br>
          &#129154;  ``/table/tr[2]/td[2]/a[2]``
        </td>
        <td style="width:70%;border:0px;fill:0px">
           "Start at the beginning of our code (``table``),<br> 
use the second row tag (``tr[2]``).<br>
Within that row tag, go into the second cell tag you can find (``td[2]``), and <br>
stop in that cell's second link tag (``a[2]``)."<br>
<br>
        </td>
    </tr>
</table>

If you take another look at the visualized HTML code above, you can recognize the XPath logic:


<div style="width:99%;border:1px solid grey;padding:3px;margin:10px;background-color:#ddd">&lt;table class="browse-node-table"&gt;
<font color="#6e6e6e">
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;tr id="this-is-the-1st-row"&gt; ...
&lt;/tr&gt;
</div>
</font>
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;tr id="this-is-the-2nd-row"&gt;
<font color="#6e6e6e">
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;td id="this-is-one-cell"&gt; ... &lt;/td&gt;
</div>
</font>
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;td id="this-is-another-cell"&gt;
<font color="#6e6e6e">
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;a ... &gt; PDF &lt;/a&gt;</div></font>
    <div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;a href="https://www.gpo.gov/link.to.text" target="_blank"&gt;    <div style="width:90%;border:0px dashed grey;padding:5px;margin:5px">Text</div>
&lt;/a&gt;</div>
<div style="width:90%;border:1px dashed grey;padding:5px;margin:10px"><font color="#6e6e6e">&lt;a ... &gt; More &lt;/a&gt;</font></div>
&lt;/td&gt;
</div>
&lt;/tr&gt;
</div>
<font color="#6e6e6e"><div style="width:90%;border:1px dashed grey;padding:5px;margin:10px">&lt;tr id="this-is-the-3rd-row"&gt; ...
&lt;/tr&gt;
</div></font>

&lt;/table&gt;
</div>
Writing the entire path is simple enough in short HTML code. However, in more convoluted code this may be no longer practical. There are two types of shortcuts to help you locate the relevant part of the code more efficiently.

#### 1.B.2 Shortcuts

Landmarks are often useful when giving directions to a friend. Rather than spelling out the entire path from start to finish, you can use a known landmark near the destination and describe the rest of the way from there. XPath also allows for such shortcuts. 

##### Unique attributes

A unique attribute is the euqivalent of a landmark in XPath. Recall that an attribute is the text inside a tag. Now observe that the cell containing our desired link is opened with a uniquley labelled ``<td>`` tag. Its ``id`` attribute has the value "this-is-another-cell". This value is unique throughout the code. Rather than spelling out the entire XPath from the top of our code, we can use this unique attribut as our landmark to start our directions from.

Inserting an attribute into an XPath works much the same way as using their order. We again use square brackets to describe a feature of the tag we are looking for. Rather than using its order number (e.g. ``[2]``), we specify the type and value of the relevant attribute. In our case, the attribute type is "``id``" and the value is "``this-is-another-cell``". As another XPath convention, we use the "@" symbol to alert our programme that what follows is an attribute of the tag we are looking for. Our XPath-landmark thus translates into "``[@id='this-is-another-cell']``". 

Using this changes our XPath from <br>
``/table/tr[2]/td[2]/a[2]`` to <br>
``.//td[@id='this-is-another-cell']/a[2]``

Looking at the differences between the two, notice that the start of our changed XPath also differs. The "``.//``" at the beginning of it tells the computer to allow all possible tag combinations before the unique ending we provide.

##### Unique tails

The second possibility to shorten an XPath direction is to restrict it to those tags that make its ending unique. What an XPath query does is ask the computer to look through the code for all paths matching our directions. So if we know which parts of our directions make the desired location unique, that unique part is all the computer needs to know. 

In our example case, what makes the "Text" link unique is that it is within the only second "``<a>``" tag throughout the code. Arguably, this is very rare, but for our special case, the relevant piece of the XPath boils down to:<br>
"``.//a[2]``".

Again, we have used the "``.//``" to tell our computer to allow for all possible tags before our desired ``a[2]``. However, allowing so much flexibility also bears a risk. In cases where we have not found the unique 



#### 1.B.3 Tool: Browser plugins help you find the XPath

The logic behind XPath is simple. However, in how many cases do we know the HTML code of a website whose text we want to collect? The good news: you don't have to sift through HTML documents yourself in order to tease out the XPath you want. Luckily, there are various browser plugins that help you extract the correct path with a few simple clicks.

One option is the "Firebug" add-on for Mozilla Firefox. Look at this video demonstration to see how it works:
[![Extracting an XPath (3 minute video)](pics/firebug video.jpg)](http://youtu.be/XsysldVfAmk?hd=1 "Extracting an XPath (3 minute video)")


Here is an incomplete list of these very useful helpers:
* Mozilla Firefox: The <a href="https://www.mozilla.org/en-US/firefox/developer/" target="_blank">Developer Kit</a> contains all you need.
* Google Chrome: for <a href="https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl"
target="_blank">XPath helper</a>
* <a href="https://stackoverflow.com/questions/34456722/getting-xpath-for-element-in-safari" target="_blank">Tutorial for Apple Safari</a>

#### Exercise: Find and compare XPaths

For this excercise, please familiarise yourself with how to extract an XPath using your browser. Once your browser is set up, please direct it to the <a href="https://www.gpo.gov/fdsys/browse/collection.action?collectionCode=CHRG" target="_blank">GPO website</a>.

Please extract the XPath to one "Text"-link of your choice.
How would you have to change the extracted path in order to direct your programme to the "PDF"-link?

## 1.C Another way to navigate inside your file: CSS selectors

Cascading Style Sheets (CSS) are a popular way to layout HTML documents. Using CSS, the programmer has the option to specify the layout of alike HTML tags in a single command. For example, the programmer could specify that all URLs shall be displayed in red and a bold font with italic style. Rather than adding this information to each link tag, CSS allows him to set this feature once and see it applied whenever the link tag appears e.g.:

`` a {``<br>
`` font-family: Arial, sans-serif;``<br>
`` font-weight: bold;``<br>
`` font-style: italic;``<br>
`` color: red;``<br>
`` }``<br>



To identify the tag of interest, the programmer uses a so-called "CSS selector". Besides applying formatting, CSS selectors can be used to navigate inside an HTML document.

Since the logic of these selectors mirrors the XPath language closely, the interested reader is referred to <a href="https://www.w3schools.com/cssref/css_selectors.asp">this tutorial</a> for more details on CSS selectors. Thankfully, the Firefox Developer Edition is able to provide the CSS selectors as described above.



## 2 Scrape basics

In scraping, we ultimately try to collect the content of a tag inside an HTML document. After we received it, we process it further. The content could be the link to a file, which we then download or import into our software. In many applications, we want to collect the text e.g. from a government announcement online.

The hoops we go through are always the same, regardless of what we are trying to collect.
1. Load the website containing the desired data
2. Store a copy of the underlying HTML file in your computer's memory
3. Parse the HTML file so you can query it
4. Apply your query to the parsed copy
5. Store your returned value (or process it further)

Let's go through these in turn.


###  2.1 Load the website & store a copy of the underyling HTML file

In [None]:
import requests

gpo = requests.get('http://www.gpo.gov/fdsys/browse/collection.action?collectionCode=CHRG')


This is what we got:

In [None]:
gpo.content

Does not look like an HTML file, right? At this point, Python does not know it is looking at an HTML document. All it sees is a large chunk of text.

To be able to run XPath queries or otherwise process our file, we first have to:
### 2.2 Parse the HTML
This is where BeautifulSoup comes in (although there are other parsers). It has various parsers to create its "soup".

The result looks much more recognizable.

In [None]:
from bs4 import BeautifulSoup

gpo_parsed = BeautifulSoup(gpo.content, 'html.parser')

print(gpo_parsed.prettify()[6967:8374])

### 2.3 Apply your query

Now that Python knows it is dealing with an HTML file, we can write a query to extract the information we want. 

BeautifulSoup supports two ways to query an HTML file: finding node names or using CSS selectors. Not to worry, we will return to XPath when applying another package further below.


BeautifulSoup offers two ways to collect the information contained in an HTML node. One can either search the HTML code for specific tags and retrieve the information contained in them. Or one can use a CSS selector to pinpoint the position of a particular tag of interest.


#### Using 'find' or 'find_all'

Using the search function comes with the option to either find all or only the first of the nodes matching the search key. Here is the code to retrieve all links from the GPO's page.

In [None]:
gpo_parsed.find_all("a")

And this only yields the first, as would using a predicate in combination with 'find_all'.

In [None]:
gpo_parsed.find("a")


In [None]:
gpo_parsed.find_all("a")[0]

#### Using CSS selectors

An alternative is using the CSS selectors. You can use the Firefox Development Editor to extract the CSS selector for the position of interest. For example, this yields all paragraphs of the main text on the page:

In [None]:
gpo_parsed.select('#browse-layout-mask p')

#### Reducing it to the text
The find_all command extracts all nodes that fit the search terms. In our case, it only found one h3-title i.e. "Congressional Hearings". Regardless of how many items it finds, BeatifulSoup stores the returned values in a list. If it finds one item that fits the search query, the list has contains 1 item. If it finds 3, the list contains 3. 

Note that the result of our "h3" query above contains not only the value of the node ("Congressional Hearings"), but also the tags and their attributes. To restrict the returned value to the text only, BeautifulSoup provides the useful ".get_text" command. However, the command can only be applied to individual items of a list, not the entire list at once. Thus we have to identify the number of the list elemment from which we want to extract only text before applying the command. In Python, one can select individual list items by stating their position as a predicate i.e. in square brackets.

Extracting the text of the headline thus translates into the command:

In [None]:
gpo_parsed.find_all("h3")[0].get_text()

Besides removing the starting and closing tags, adding ".get_text()" will also remove tags inside the text you have extracted. For instance, the third paragraph in the text displayed at the start of this section contains a link (signalled by the ``<a href=...>`` tag). Applying ".get_text()" likewise removes the URL from the returned value.

To see this, use the following code to get out all tags (``<a>`` or otherwise) from the last paragraph of our GPO page.

In [None]:
gpo_parsed.select('#browse-layout-mask p')[2].get_text()

### 2.4 Store the returned value

Since we have now processed the collected text, we conclude with storing it on our local hard drive for future applications. In this example, we write a new text file containing the headline of the GPO's page.

In [None]:
file_name='my first file.txt'

text=gpo_parsed.find_all("h3")[0].get_text()

file = open(file_name,'w')
file.write(text)
file.close()

### 2.5 Exercise: Go through all steps yourself

To practice your scraping skill, try to collect the transcript of an individual hearing.

To do this, we have to change the URL somewhat.

In [None]:
exercise=requests.get('https://www.gpo.gov/fdsys/browse/collection.action?collectionCode=CHRG&browsePath=115%2FSENATE%2FCommittee+on+Foreign+Relations&isCollapsed=false&leafLevelBrowse=false&isDocumentResults=true&ycord=325.5')

Now write the code to collect the text (not the file) for the Hearing "S. Hrg. 115-4 - Nomination of Rex Tillerson to Be Secretary of State".

Print the first 1000 characters, just to be sure. Then, store your hearing into a local text file.

## 3 Adding complex pages: Let your computer browse the web

Now that we know all about giving directions, it is time to let your computer browse the web. This section introduces the Selenium package. Originally designed to test new website, the Selenium package allows you to control your web browser without moving your mouse or touching your keyboard. Instead, the program navigates the site based on the XPath directions introduced above.

The introduction to Selenium remains very close to the goals of this tutorial. Please see the package's <a href="https://selenium-python.readthedocs.io/" target="_blank">documentation</a> for more information. Furthermore, specific tutorials can be found easily online as the Selenium package is very popular among Python users. 

#### Browser setup

Before letting it navigate the web, we have to award additional abilities to our program. When we want our smartphone to learn a new trick, we install a new app. In Python, our programming language, the equivalent of installing an app is loading a module. 

For what we want to do in this tutorial, we will have to load four modules. Our program uses the Operating System module ("os") which allows the program to work on your hard drive, the urllib module ("urllib") to collect text from a webpage, and the Selenium module ("selenium") to mimic the user's click routine on the page. Finally, we use a module called "time" to allow our program to take short breaks when we want it to wait before browsing on.

In [None]:
import os
import urllib
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep

#### Starting the browser

With these modules in memory, our program is now ready to start browsing. The first step is to open a browser. In our example, we will browse the web using Mozilla Firefox. However, Selenium works on most popular browsers.

Our computer does not have its own mouse or keyboard with which it could open a new browser window. Instead, what Selenium uses is a so-called "webdriver". A webdriver basically opens the channel between our program and the browser installed on your computer.

To ask the webdriver to open up Firefox, all we need is:

In [None]:
webdriver.Firefox()


To open a Chrome window, the command is almost identical:

In [None]:
webdriver.Chrome()

Please close both browser windows manually again. 

The promise of this tutorial was that the computer would browse the web without your interaction. Yet, you were just asked to close two browser windows. Not to worry, the program can also close browser windows (or open new tabs and the like). However, for it to be able to do that, we have to give the browser window a name. Once it has a name, we can tell the program which browser window to work on. 

To see this, please run the following code. What will happen is that a new Firefox window opens. We call this window "my_fox". The programm will then wait for 5 seconds before it closes the "my_fox"-window again.

In [None]:
my_fox=webdriver.Firefox()
sleep(5)
my_fox.close()

#### Basic navigation

So far, so spooky. Now let's have the programme go to our GPO website. 

First, we open a new browser window again:

In [None]:
gpo=webdriver.Firefox()

Now, we tell the programme to get the GPO's link into that browser window:

In [None]:
gpo.get("http://www.gpo.gov/fdsys/browse/collection.action?collectionCode=CHRG")

<b>Your turn:</b><br>Let the browser now move to a website of your choice:

In [None]:
gpo.get("enter your link here")

Getting to a page is nice, but we also want to interact with that page. Recall that the browser sees this page as a series of tags. Selenium refers to these tags as "elements" on the page. What we have to do then is direct the program at the page element we want to interact with. This is where the XPath language we learnt above comes in. But before we get back to that, let's look at a more intuitive way to interact.

In Selenium, it's not strictly necessary to know each and every element by its XPath. While XPath is the most precise way to get where want, there are less general, but more user-friendly ways as well. For instance, the Selenium package includes a function that searches the webpage for elements containing a certain piece of text. 

Say, I want to start browsing the Hearings of the 155th Congress. First, let's return to that page again:

In [None]:
gpo.get("http://www.gpo.gov/fdsys/browse/collection.action?collectionCode=CHRG")

To locate the link we want to click on, we have to supply a unique piece of text to the programme. Let's use "115th Congress". 

We also want to store the location of this element so we can use it in later commands. The logic is the same as with naming our browser window "my_fox" or "gpo". Once we assign a name to the element, we can ask the program to use it in later applications. For this example, we call the location "congress_115".

In [None]:
congress_115=gpo.find_element_by_partial_link_text("115th Congress")

Now that the program knows element, why not have it click on it?

In [None]:
congress_115.click()

And let's go look at the Hearings.

In [None]:
hearing=gpo.find_element_by_partial_link_text("House Hearings")
hearing.click()

Oh, actually we meant the Senate Hearings; specifically those of the Appropriations Committee.

In [None]:
gpo.back()
sleep(.5)

hearing=gpo.find_element_by_partial_link_text("Senate Hearings")
hearing.click()
sleep(.5)

committee=gpo.find_element_by_partial_link_text("Appropriations")
committee.click()

And now let's have a look at one of those texts. We will use the first one from the top. 

In [None]:
text= gpo.find_elements_by_partial_link_text("Text")
text[0].click()

You can see how this is coming along nicely. But before we go on, please notice two things about the code we just executed. First, there is an extra "s" in the command "``find_element``<b>s</b>``_by_partial_link_text``". It was added to the command because the text we are looking for fits more than one element on the page. Our program would have been confused if we had used the command without the extra "s".

The second point to note is the zero in the squared brackets. This is another Python convention. We see it for the first time in this tutorial. This, somewhat unintuitive convention says that lists in Python are number starting from zero. So the first list item is item zero and the tenth item would be item eleven. In order to click on the first element, we thus have to write "[0]". 

Back to our text. As you can see in your browser, the GPO provides transcripts of the Hearings as HTML documents including basic, unformatted text. This is good news for us since we do not have to worry about finding the text in a particular location on the page. Rather, we can download the whole page and store it into a text file. In the end, we will precisely do just that: ask our program to download and store the many HTML files that it finds behind all the "Text" links.

There is one inconvenience between us and the conlusion of this tutorial right now: The Hearing text was opened in a new browser window. Before we can start storing individual text files, we need to get back to our original window. 

The way we can do this is allow the program to exploit the keyboard shortcuts of our browser. If you want to close a tab, you have two options: You can click on the "x" in its righthand corner. Or you can press "CTRL + w" on your keyboard ("COMMAND + w" for Macs). Our program is able to do just that as well.

In the code below, we first ensure that the browser tab is active. We do this by having the program locate a tag that is inside the HTML we are looking at. Probably the most general tag in HTML world is the "body" tag. We will have our program search for this one. Now that we are sure that the program is looking at the browser tab we want to close, we will send the key combindation "CTRL + w" right to it.

In [None]:
gpo.find_element_by_tag_name('*').send_keys(Keys.CONTROL + 'w')

Welcome back!<br>

#### Storing the text

Rather than opening the link under "Text", we want to store the HTML file it leads to onto our computer. Let's first look at this link once. Remember from our basic HTML above that the link is found in the "href" attribute of the ``<a>`` tag. To extract this link, we write:

In [None]:
hearing_text_link = text[0].get_attribute('href')
print(hearing_text_link)

Notice that the extracted link ends with ".htm", the file extension for HTML files (another one is ".html"). What we thus have in front of us is the path to an HTML file of the Hearing transcript. What is left to do is to use this link, ask our programme to download the ".htm" file and convert it into a text file (".txt"). Converting the HTML file into the text file is very easy thanks to the inexistent page formatting used by the GPO to display the transcripts. All we have to do is change the ending of the file from ".htm" to ".txt".

The last package we loaded at the start of this tutorial, urllib, will do exactly that. All that we have left to do then is to set the file name which we want to use. 

In [None]:
file_name="my second file.txt"
urllib.request.urlretrieve(hearing_text_link, file_name)

That's it! Our program just learned how to download its first Hearing transcript! The program will have to do it many times over in order to store all the Hearings that we can find online. But before we get to do that, let's first close the browser window again:

In [None]:
gpo.quit()