# HTML vs XML

- Both HTML and XML are markup languages
  - HyperText Markup Language
    - Hypertext: text with links (hyperlinks) to other text
  - eXtensible Markup Language


Example from [W3](https://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic_document)

<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>


- HTML contains a fixed set of tags (not eXtensible) (XML is a framework for defining markup languages whereas HTML is a markup language)
- HTML focusses on displaying instead of describing data (semantically less interesting)
- HTML allows small syntactic errors (XML is very unforgiving)


Most importantly, if you know XML, you are ready work with HTML

Today, practical advice on downloading, parsing and navigating HTML documents from the web.

# Accessing content with requests

In [None]:
import requests

In [None]:
url = 'https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h/19033-h.htm'

In [None]:
response = requests.get(url)

## Inspecting HTML

In [None]:
# response codes indicate i HTTP request was completed successfully
response

List of response codes is available [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

- HTML often more difficult to navigate and query then XML ("dirtier/complex")
    - Example of the [The Guardian](https://www.theguardian.com/uk)
- HTML not always syntactically well-formed (parsers needs to be more forgiven

Inspecting the HTML underlying ["Alice in Wonderland"](https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h/19033-h.htm).

In [None]:
response.content

In [None]:
type(response.content)

## Parsing HTML

In [None]:
from lxml import etree
from io import StringIO, BytesIO

In [None]:
# parsin HTML as an XML document often yields an error
tree = etree.parse(BytesIO(response.content))

In [None]:
# we can define a HTML parser
parser = etree.HTMLParser()

In [None]:
html_tree = etree.parse(BytesIO(response.content), parser)

In [None]:
html_tree

In [None]:
html_root = html_tree.getroot()

In [None]:
len(html_root)

In [None]:
for el in html_root: print(el.tag)

In [None]:
print(etree.tostring(el[0],method='text',encoding='unicode'))

## Navigating HTML tags

[Popular HTML tags](http://www.columbia.edu/~sss31/html/html-tags.html).

In [None]:
headings = html_root[1].xpath('.//h2'); len(headings)

In [None]:
for heading in headings: print(etree.tostring(heading,method='text',encoding='unicode'))

### Mining links

In [None]:
links = html_root[1].xpath('.//a'); len(links)

In [None]:
# links to images
links[1].attrib

In [None]:
# links to other parts of the document
links[10].attrib

In [None]:
# anchor text
links[10].text

In [None]:
hrefs = html_root[1].xpath('.//a/@href'); len(hrefs)

In [None]:
hrefs

## Handling images

In [None]:
base_url = 'https://ia800908.us.archive.org/6/items/alicesadventures19033gut/19033-h'
base_url + '/' + hrefs[0]

In [None]:
img = requests.get(base_url + '/' + hrefs[0])

In [None]:
img.content

In [None]:
from IPython.display import Image
Image(img.content)

In [None]:
with open('../../cover.jpg','wb') as handler:
    handler.write(img.content)

## ✏️ Open Exercise: Comparing Online Newspapers

- Build a program that retrieves anchor text from the landing of page of the Guardian (and another paper, which you are free to choose)
- For each newspaper, collect all anchor text in one variable (or document)
- Compute word frequencies for each newspaper

In [None]:
# enter your code here

# Fin.