# XML and HTML: Web Scraping

Python has many libraries for reading and writing data in the ubiquitous HTML and XML formats. lxml (http://lxml.de) is one that has consistently strong performance in parsing very large files. lxml has multiple programmer interfaces; first I’ll show using  lxml.html for HTML, then parse some XML using lxml.objectify.

Many websites make data available in HTML tables for viewing in a browser, but not downloadable as an easily machine-readable format like JSON, HTML, or XML. I noticed that this was the case with Yahoo! Finance’s stock options data. If you aren’t familiar with this data; options are derivative contracts giving you the right to buy (call option) or sell (put option) a company’s stock at some particular price (the strike) between now and some fixed point in the future (the expiry). People trade both call and put options across many strikes and expiries; this data can all be found together in tables on Yahoo! Finance.

> To get started, find the URL you want to extract data from, open it with urllib2 and parse the stream with lxml like so:

In [2]:
from lxml.html import parse
from urllib.request import Request, urlopen

In [3]:
parsed = parse(urlopen('http://www.nytimes.com'))
doc = parsed.getroot()

Using this object, you can extract all HTML tags of a particular type, such as table tags containing the data of interest. As a simple motivating example, suppose you wanted to get a list of every URL linked to in the document; links are a tags in HTML. Using the document root’s findall method along with an XPath (a means of expressing “queries” on the document):

In [4]:
link = doc.findall('.//a')

link

[<Element a at 0x1d8e42b7bf0>,
 <Element a at 0x1d8e42b7470>,
 <Element a at 0x1d8e42b76a0>,
 <Element a at 0x1d8e42b7e20>,
 <Element a at 0x1d8e42b7ce0>,
 <Element a at 0x1d8e42b7c90>,
 <Element a at 0x1d8e42b7c40>,
 <Element a at 0x1d8e42b7dd0>,
 <Element a at 0x1d8e42b7d30>,
 <Element a at 0x1d8e42b7330>,
 <Element a at 0x1d8e42b70b0>,
 <Element a at 0x1d8e42b72e0>,
 <Element a at 0x1d8e42b7010>,
 <Element a at 0x1d8e42b6fc0>,
 <Element a at 0x1d8e42b6f70>,
 <Element a at 0x1d8e42b7290>,
 <Element a at 0x1d8e42b7060>,
 <Element a at 0x1d8e42b6200>,
 <Element a at 0x1d8e42b6610>,
 <Element a at 0x1d8e42b6570>,
 <Element a at 0x1d8e42b65c0>,
 <Element a at 0x1d8e42b6de0>,
 <Element a at 0x1d8e42b73d0>,
 <Element a at 0x1d8e42b7380>,
 <Element a at 0x1d8e42b7420>,
 <Element a at 0x1d8e42b6bb0>,
 <Element a at 0x1d8e42b6ac0>,
 <Element a at 0x1d8e42b6b60>,
 <Element a at 0x1d8e42b69d0>,
 <Element a at 0x1d8e42b68e0>,
 <Element a at 0x1d8e42b6930>,
 <Element a at 0x1d8e42b6980>,
 <Elemen

But these are objects representing HTML elements; to get the URL and link text you have to use each element’s get method (for the URL) and text_content method (for the display text):

In [5]:
lnk = link[15]

lnk

<Element a at 0x1d8e42b7290>

In [6]:
lnk.get('herf')

In [7]:
lnk.text_content()

'Politics'

In [8]:
for i in range(len(link)):
    v = link[i].text_content()

    print(v, " ")

Skip to content  
Skip to site index  
  
U.S.  
International  
Canada  
Español  
中文  
Log in  
  
Today’s Paper  
  
  
World  
U.S.  
Politics  
N.Y.  
Business  
Opinion  
Tech  
Science  
Health  
Sports  
Arts  
Books  
Style  
Food  
Travel  
Magazine  
T Magazine  
Real Estate  
Video  
World  
U.S.  
Politics  
N.Y.  
Business  
Opinion  
Tech  
Science  
Health  
Sports  
Arts  
Books  
Style  
Food  
Travel  
Magazine  
T Magazine  
Real Estate  
Video  
Jan. 6 Panel After 8 Hearings: Where Will the Evidence Lead?Analysis: Comprehensive and compelling, the hearings on the Capitol riot laid out a powerful account of Donald Trump’s efforts to overturn the 2020 election.But it’s unclear if that will be enough to achieve the panel’s legal and political goals.Haiyun Jiang/The New York Times  
Bannon Found Guilty of Contempt in Case Related to Capitol Riot InquirySteve Bannon is the first close aide to former President Trump to be convicted as a result of an investigation into th

hus, getting a list of all URLs in the document is a matter of writing this list comprehension:

In [9]:
urls = [lnk.get('herf') for lnk in doc.findall('.//a')]


urls

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

Now, finding the right tables in the document can be a matter of trial and error; some websites make it easier by giving a table of interest an id attribute. I determined that these were the two tables containing the call data and put data, respectively:

In [10]:
tables = doc.findall('.//table')
calls = tables[0]
puts = tables[0]

Each table has a header row followed by each of the data rows:

In [11]:
rows = calls.findall('.//tr')
rows

[<Element tr at 0x1d8e431afc0>,
 <Element tr at 0x1d8e431b010>,
 <Element tr at 0x1d8e431b060>,
 <Element tr at 0x1d8e431b0b0>]

For the header as well as the data rows, we want to extract the text from each cell; in the case of the header these are th cells and td cells for the data:

In [12]:
def unpack(row, kind = 'td'):
    elts = row.findall('.//%s' % kind)
    return [val.text_content() for val in elts]

In [14]:
unpack(rows[0], kind = 'th')

['United States\xa0›', 'United StatesAvg. on Jul. 22', '14-day change']

Now, it’s a matter of combining all of these steps together to convert this data into a DataFrame. Since the numerical data is still in string format, we want to convert some, but perhaps not all of the columns to floating point format. You could do this by hand, but, luckily, pandas has a class TextParser that is used internally in the read_csv and other parsing functions to do the appropriate automatic type conversion:

In [15]:
from pandas.io.parsers import TextParser

In [16]:
def parse_options_data(table):
 rows = table.findall('.//tr')
 header = unpack(rows[0], kind='th')
 data = [unpack(r) for r in rows[1:]]
 return TextParser(data, names=header).get_chunk()

Finally, we invoke this parsing function on the lxml table objects and get DataFrame results:

In [17]:
call_data = parse_options_data(calls)
put_data = parse_options_data(puts)

In [18]:
call_data[:10]

Unnamed: 0,United States ›,United StatesAvg. on Jul. 22,14-day change
New cases,127569,+18%,
Hospitalized,42710,+17%,
New deaths,444,+38%,


# Parsing XML with lxml.objectify

XML (extensible markup language) is another common structured data format supporting hierarchical, nested data with metadata. The files that generate the book you are reading actually form a series of large XML documents.

Above, I showed the lxml library and its lxml.html interface. Here I show an alternate interface that’s convenient for XML data, lxml.objectify.

Using lxml.objectify, we parse the file and get a reference to the root node of the XML file with getroot

In [25]:
from lxml import objectify
from pandas import DataFrame

In [22]:
path = '../../CSV Files/O_Reilly/ch06/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

root.INDICATOR return a generator yielding each INDICATOR XML element. For each record, we can populate a dict of tag names (like YTD_ACTUAL) to data values (excluding a few tags):

In [24]:
data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
                'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root:
 el_data = {}
 for child in elt.getchildren():
    if child.tag in skip_fields:
        continue
    el_data[child.tag] = child.pyval
 data.append(el_data)

Lastly, convert this list of dicts into a DataFrame:

In [26]:
perf = DataFrame(data)

perf

Unnamed: 0,AGENCY_NAME,INDICATOR_NAME,DESCRIPTION,PERIOD_YEAR,PERIOD_MONTH,CATEGORY,FREQUENCY,INDICATOR_UNIT,YTD_TARGET,YTD_ACTUAL,MONTHLY_TARGET,MONTHLY_ACTUAL
0,Metro-North Railroad,Escalator Availability,Percent of the time that escalators are operat...,2011,12,Service Indicators,M,%,97.0,,97.0,


ML data can get much more complicated than this example. Each tag can have metadata, too. Consider an HTML link tag which is also valid XML:

In [28]:
from io import StringIO
tag = '<a href="http://www.google.com">Google</a>'

root = objectify.parse(StringIO(tag)).getroot()

You can now access any of the fields (like href) in the tag or the link text:

In [29]:
root

<Element a at 0x1d8eba331c0>

In [30]:
root.get('herf')

In [32]:
root.text

'Google'