# Scraping a single web page

# Preliminaries

For our example, we will scrape a document from the Securities and Exchange Commission (SEC): 

https://www.sec.gov/Archives/edgar/data/34088/000003408816000065/0000034088-16-000065-index.htm

## What is allowed?

All published U.S. government data are in the public domain, so no copyright issues with the SEC website. For other websites, look for a Terms of Use statement to determine if your use is allowed.

Find out what parts of the website are considered OK to be scraped: https://www.sec.gov/robots.txt

/Archives/edgar/data is allowed to be scraped.

## Initial setup

If necessary, install the following libraries using PIP, Conda, or your IDE's package manager:
- requests
- beautifulsoup4
- soupseive
- html5lib

In [2]:
import requests   # best library to manage HTTP transactions
import csv # library to read/write/parse CSV files
from bs4 import BeautifulSoup # web-scraping library

## Define a function to write results to a CSV

In [3]:
def writeCsv(fileName, array):
    fileObject = open(fileName, 'w', newline='', encoding='utf-8')
    writerObject = csv.writer(fileObject)
    for row in array:
        writerObject.writerow(row)
    fileObject.close()

## Create a User-Agent string

The User-Agent HTTP request header identifies the client to the server.  In typical web use, the user agent is a web browser.  Here's an example user agent description:

```
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6)
```

The description starts with the name and version of the software, followed by some identifying information about the creator in parentheses.  Nefarious scrapers might try to impersonate a web browser by sending its user agent information.  We don't want to do that. If you don't include a user agent, the default Python user agent string will be sent and that may be blocked automatically by some web servers.

Here's an example user agent:

```
BaskaufScraper/0.1 (steve.baskauf@vanderbilt.edu)
```

I've given the client a descriptive name that is likely to be unique and a very low version number indicating that it's in development.  I've included my email address so that if my script does something bad, the server admins can email me about it instead of just immediately blocking me.

In [4]:
baseUrl = 'https://www.sec.gov'

url = 'https://www.sec.gov/Archives/edgar/data/34088/000003408816000065/0000034088-16-000065-index.htm'
acceptMediaType = 'text/html'
userAgentHeader = 'BaskaufScraper/0.1 (steve.baskauf@vanderbilt.edu)'
requestHeaderDictionary = {
    'Accept' : acceptMediaType,
    'User-Agent': userAgentHeader
}

# Scrape the web page

Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful soup creates a custom object that prints out as HTML, but that has special attributes and methods that allows its XML structure to be traversed.

We will use the following attributes and methods:

- `.find_all(elementNameString)` method: creates a list of all descendant elements named `elementNameString`
- `.contents` attribute: a list of all child elements
- `.name` attribute: the name of the element
- `.text` attribute: the text value contained in the element
- `.get(attributeName)` method: extract the value of the attribute named `attributeName`

## Retrieve the HTML

Retrieve the text of the web page using HTTP, then clean it up by creating a Beautiful Soup object

In [5]:
response = requests.get(url, headers = requestHeaderDictionary)
soupObject = BeautifulSoup(response.text,features="html5lib")
print(soupObject)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>EDGAR Filing Documents for 0000034088-16-000065</title>
<link href="/include/interactive.css" rel="stylesheet" type="text/css"/>
</head>
<body style="margin: 0">
<!-- SEC Web Analytics - For information please visit: https://www.sec.gov/privacy.htm#collectedinfo -->
<noscript><iframe height="0" src="//www.googletagmanager.com/ns.html?id=GTM-TD3BKV" style="display:none;visibility:hidden" width="0"></iframe></noscript>
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'//www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-TD3BKV');</script

## Extract the data from the HTML

There are two tables in the document.  Extract them, then pull out the first one.

In [None]:
tableObject = soupObject.find_all('table')[0]
print(tableObject)

Extract the rows from the table.

In [None]:
rowObjectsList = tableObject.find_all('tr')
print(rowObjectsList)

Find the data rows, collect the desired cells, and save as a spreadsheet.

In [None]:
# Create a list of lists to output as a CSV
outputTable = []
for row in rowObjectsList:
    outputRow = []
    
    # determine if the row is a header or data row
    dataRow = False
    childElements = row.contents
    for element in childElements:
        if element.name == 'td':
            dataRow = True
    
    # if its a data row, then get the contents of the Description and the Type and the URL of the Document
    if dataRow:
        cells = row.find_all('td')
        outputRow.append(cells[1].text)
        outputRow.append(baseUrl + cells[2].a.get('href'))
        outputRow.append(cells[3].text)
        print(cells[1].text)
        print(baseUrl + cells[2].a.get('href'))
        print(cells[3].text)
        print()
        outputTable.append(outputRow)

# Write the lists of lists to a CSV spreadsheet
fileName = 'sec_table.csv'
writeCsv(fileName, outputTable)

# Scraping the same data from multiple pages



In [13]:
rowObjects = soupObject.find_all('tr')
for row in rowObjects:
    cellObjects = row.find_all('td')
    for cell in cellObjects:
        if cell.text == '10-K':
            print(baseUrl + cellObjects[2].a.get('href'))


https://www.sec.gov/Archives/edgar/data/34088/000003408816000065/xom10k2015.htm
