# 6: XML

* XML basics
* Exploring XML with Beautiful Soup
* Webpages (HTML, etc.)
* Exercises

## XML basics

[XML](https://en.wikipedia.org/wiki/XML) (eXtensible Markup Language) is a standard way to add explicit structure or information to text. Spans of text are enclosed in _tags_ representing metatextual information, forming an _element_. Most tags consist of a pair of _start-tag_ and and _end-tag_. 

<greeting> Hello world!</greeting>

Tags are enclosed in angle brackets, and end tags are distinguished from start tags with a forward slash after the first bracket.

XML is a general framework that does not define specific valid tags. That said, there are a few rules about what can be a tag:

So, you can follow pretty much the same rules for valid Python variables and XML tags. XML tags are case sensitive, so make sure your opening and closing tags are exactly the same.

Importantly, XML elements can be nested in other XML elements.

<sent> this <copula>is</copula> <NP>a valid XML element</NP> </sent>

Since all XML documents form trees, there are things we can't do, like this:

**Here we are ending a tag before we end a nested tag.**

It's not possible to draw this as a tree (we can try, but...)


Though note that it is perfectly okay to nest multiple instances of the same tag. XML parsers will  close each element assuming you know what you're doing (in terms of creating valid XML):

Though XML can be written without any line breaks or any formatting at all (it isn't Pythonic at all in that regard!!!), it might be easier if we look at it in another way (though note we've inserted some newlines here to accomplish this):

Start-tags optionally include attributes and their values

The form of attribute names have the same restrictions as tags, and they should be consistent across elements with the same tags (but strickly speaking they don't _have_ to be). Attribute values are always strings (enclosed in quotes) even when they are numbers.

It is also possible to have an "empty" XML tag that contains no text at all, forming a non-text leaf of the XML document tree. For this, you put the backslash at the end of the tag, after a space.

<escapethiscode> prob += x + y &lt;= 10 </escapethiscode>
<esapethisheadline> Texas A&amp;M has won the game!</escapethisheadline>

You will sometimes see other characters expressed using this same kind of syntax, there are [tables](https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references) which will give you the escaped version of any character.

Note that most packages you would use (like beautiful soup, discussed below), will deal with this escaping automatically.

There are also comments in XML.

## Exploring XML with Beautiful Soup

You can use regular expressions to efficiently pull out spans of text associated with a specific tag. If you don't need to access the document structure, this is an efficient choice. 

In [1]:
example = '''
<text type="example"> 
    <sent type="declarative" n="1"> This is a <keyword>declarative</keyword> sentence<punct>.</punct></sent>
    <sent type="imperative" n="2"> Read this <keyword>imperative</keyword> sentence<punct>!</punct></sent>
    <sent type="interrogative" n="3"> Is this <keyword>interrogative</keyword> sentence okay<punct>?</punct></sent>
</text>
'''

In [2]:
import re

regex = r"<keyword>([^<]+)</keyword>"

for match in re.finditer(regex,example):
    print(match.group(1))

declarative
imperative
interrogative


However, in other cases it is useful to take advantage of a parser that allows you to programmatically explore the XML document tree. There are several packages which will do this for you, one of the most popular is [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc). You can just [create a "Soup" tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup) by passing in a string of XML. BeautifulSoup actually supports multiple parsing strategies, so you have to tell it that you want to use "lxml" or it will complain.

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(example)

In [4]:
soup

<html><body><text type="example">
<sent n="1" type="declarative"> This is a <keyword>declarative</keyword> sentence<punct>.</punct></sent>
<sent n="2" type="imperative"> Read this <keyword>imperative</keyword> sentence<punct>!</punct></sent>
<sent n="3" type="interrogative"> Is this <keyword>interrogative</keyword> sentence okay<punct>?</punct></sent>
</text>
</body></html>

Note that BeautifulSoup has added two standard layers of tags (html, body) around the existing XML. BeautifulSoup is primarily used to parse web pages, so this makes a certain amount of sense, though is a bit presumptious!

You can use the [prettify](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#pretty-printing) method if you want a bit more structure (make sure you use print!).

In [5]:
print(soup.prettify())

<html>
 <body>
  <text type="example">
   <sent n="1" type="declarative">
    This is a
    <keyword>
     declarative
    </keyword>
    sentence
    <punct>
     .
    </punct>
   </sent>
   <sent n="2" type="imperative">
    Read this
    <keyword>
     imperative
    </keyword>
    sentence
    <punct>
     !
    </punct>
   </sent>
   <sent n="3" type="interrogative">
    Is this
    <keyword>
     interrogative
    </keyword>
    sentence okay
    <punct>
     ?
    </punct>
   </sent>
  </text>
 </body>
</html>


We can find a specific node or node in the tree by using the [find/find_all](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree) methods. It accepts strings, and also lists and even compiled regular expressions!

In [6]:
for node in soup.find_all(["sent", "punct"]):
    print(node)

<sent n="1" type="declarative"> This is a <keyword>declarative</keyword> sentence<punct>.</punct></sent>
<punct>.</punct>
<sent n="2" type="imperative"> Read this <keyword>imperative</keyword> sentence<punct>!</punct></sent>
<punct>!</punct>
<sent n="3" type="interrogative"> Is this <keyword>interrogative</keyword> sentence okay<punct>?</punct></sent>
<punct>?</punct>


In [7]:
regex = re.compile(r".{3}word")

for node in soup.find_all(regex):
    print(node)

<keyword>declarative</keyword>
<keyword>imperative</keyword>
<keyword>interrogative</keyword>


In [8]:
for node in soup.find_all("punct"):
    print(node)

<punct>.</punct>
<punct>!</punct>
<punct>?</punct>


The nodes that we get by doing a find/find_all are not strings, but rather a special type, a `Tag`. It has four key attributes:

- `name`: the string identifier for the tag associated with the node
- `attrs`: a dictionary of the XML attributes, providing key/value pairs
- `contents`: a list of the node's children in the XML document tree
- `parent`: the node's parent in the XML document tree

Let's explore these attributes a bit, starting with first sentence

In [9]:
node = soup.find("sent")

In [10]:
node.name

'sent'

In [11]:
type(node)

bs4.element.Tag

In [12]:
node.contents

[' This is a ', <keyword>declarative</keyword>, ' sentence', <punct>.</punct>]

In [13]:
node.attrs

{'type': 'declarative', 'n': '1'}

In [14]:
node.parent.name

'text'

In [15]:
for child in node.parent.contents:
    print(child.name)

None
sent
None
sent
None
sent
None


A couple things with (XML) attributes. First, you can treat the node itself as if it was a dictionary for the purpose of accessing its attributes

In [16]:
node

<sent n="1" type="declarative"> This is a <keyword>declarative</keyword> sentence<punct>.</punct></sent>

In [17]:
node["n"]

'1'

Attributes can also be accessed during find/find_all, just add the key as a (function) keyword! You can search for particular values with the same options as tags (strings, list of strings, regex)

In [18]:
soup

<html><body><text type="example">
<sent n="1" type="declarative"> This is a <keyword>declarative</keyword> sentence<punct>.</punct></sent>
<sent n="2" type="imperative"> Read this <keyword>imperative</keyword> sentence<punct>!</punct></sent>
<sent n="3" type="interrogative"> Is this <keyword>interrogative</keyword> sentence okay<punct>?</punct></sent>
</text>
</body></html>

In [19]:
for x in soup.find_all(n="2"):
    print(x)

<sent n="2" type="imperative"> Read this <keyword>imperative</keyword> sentence<punct>!</punct></sent>


`contents` and `parent` allow you to manually explore the document tree around a node. For example, you can iterate over the list, or work your way up to the root node.

In [20]:
node

<sent n="1" type="declarative"> This is a <keyword>declarative</keyword> sentence<punct>.</punct></sent>

In [21]:
node.parent.parent.name

'body'

In [22]:
for child in node.contents:
    print(child.name)

None
keyword
None
punct


Note that find/find_all can also be used on nodes other than the root, and will only search under that node:

(**Going deeper into the tree**)

In [23]:
for subnode in node.find_all("punct"):
    print(subnode)

<punct>.</punct>


Text nodes are not strings, instead they are wrapped up in a special object (NavigableString) that form the leaves of the tree.

In [24]:
type(node.contents[0])

bs4.element.NavigableString

In [25]:
node.contents[0]

' This is a '

In [26]:
node.contents[1]

<keyword>declarative</keyword>

In [27]:
node.contents[2]

' sentence'

In [28]:
text = str(node.contents[0])
text

' This is a '

In [29]:
type(text)

str

If you want to get all the text under a particular node in the tree as a string (with any XML markup removed), the best way is to use [get_text]((https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text).

**Note: This method will ignore all the other different tags that come along with the text.**

In [30]:
node.get_text()

' This is a declarative sentence.'

In [31]:
type(node.get_text())

str

## Web pages (HTML, etc.)

Web pages are written using HTML, which you can treat as a version of XML that is specific to the purpose of formatting documents for viewing on web browsers (it's a bit more complicated than that, but...). All the basics of XML syntax applies to HTML (though node HTML has a fixed set of tags). And that means we can and should process them using Beautiful Soup!

(Fun fact: Microsoft Word .docx files are also XML documents under the hood, though they are complex and you're better off using a dedicated package like [python-docx](https://python-docx.readthedocs.io/en/latest/user/documents.html) rather the Beautiful Soup to read them)

First, though, webpages have to be opened, using a URL (same as you see in your browser). To do so, there's a special urlopen command (from the package urllib.request); just pass it the URL. This creates a filepointer object (more about these next week).

In [32]:
from urllib.request import urlopen

url = "http://www.ubc.ca"

f = urlopen(url)

If you're using BeautifulSoup, you can just pass this filepoint directly to the BeautifulSoup constructor. So easy! 

Note you can also do this for a filepointer for an XMl/HTML file you have created from a file on disk with regular `open`.

In [33]:
html_soup = BeautifulSoup(f,"lxml")
html_soup2 = BeautifulSoup(urlopen(url), "lxml")

Lets take a look at what we've got!

In [34]:
html_soup

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7 oldie" lang="en-ca"><![endif]--><!--[if (IE 7)&!(IEMobile)]><html class="ie7 oldie" lang="en-ca"><![endif]--><!--[if (IE 8)&!(IEMobile)]><html class="ie8 oldie" lang="en-ca"><![endif]--><!--[if (IE 9)&!(IEMobile)]><html class="ie9" lang="en-ca"><![endif]--><!--[[if (gt IE 9)|(gt IEMobile 7)]><!--><html lang="en-ca"><!--<![endif]-->
<head>
<meta charset="utf-8"/>
<title>The University of British Columbia</title>
<meta content="width=device-width" name="viewport"/>
<meta content="The University of British Columbia is a global centre for research and teaching, consistently ranked among the top 20 public universities in the world." name="description"/>
<meta content="16761458703" property="fb:pages"/>
<!-- Stylesheets -->
<link href="//cdn.ubc.ca/clf/7.0.5/css/ubc-clf-full-bw.min.css" rel="stylesheet"/>
<link href="//cloud.typography.com/6804272/781004/css/fonts.css" rel="stylesheet" type="text/css"/>
<link href="/_assets/css/style.min.

Some HTML tags that you should know:

* `<html>` The entire page is nested in this tag
* `<head>` Metadata associated with the page and other miscellaneous information is stored here. No text!
* `<body>` This is where everything which is displayed can be found
* `<div>` This creates sections of the document (like headers, sidebars, etc.). Very important!
* `<form`> Used for user input, e.g. a search bar
* `<a>` An anchor, this creates links to other pages, link is in `href` attribute
* `<li`> A list, with bullets
* `<h1>` header, number indicates size (also `<h2>`, etc.)
* `<img>` A image, `src` gives its location
* `<table>` A table, to display data.
* `<p>` a paragraph, has formating options (e.g. indentation, spacing)
* `<span>` a span of text, smaller than a paragraph, formatting options like colour highlighting
* `<script>` used to load javascript (the programming language run by browsers)

Most HTML tags on modern web pages will have a `class` attribute. This refers to one or more CSS classes, which are used to format the object on the page. 

If you weren't aware, Jupyter notebooks support html in their markup, so it's easy to play around with!<p> Here's a paragraph with a <a href="http://www.ubc.ca">UBC link</a></p> 

## Exercises

In [35]:
TEI = '''
<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader>
  <fileDesc>
   <titleStmt>
    <title>Emma</title>
    <author>Jane Austen</author>
   </titleStmt>    
  </fileDesc>
 </teiHeader>
 <text>
   <p n="21">Mr. Knightley , a sensible man about seven or eight-and-thirty , was not only a very old and intimate friend of the family , but particularly connected with it , as the elder brother of Isabella 's husband . He lived about a mile from Highbury , was a frequent visitor , and always welcome , and at this time more welcome than usual , as coming directly from their mutual connexions in London . He had returned to a late dinner , after some days ’ absence , and now walked up to Hartfield to say that all were well in Brunswick Square . It was a happy circumstance , and animated Mr. Woodhouse for some time . Mr. Knightley had a cheerful manner , which always did him good ; and his many inquiries after “ poor Isabella ” and her children were answered most satisfactorily . When this was over , Mr. Woodhouse gratefully observed ,<said who="Woodhouse">“ It is very kind of you , Mr. Knightley , to come out at this late hour to call upon us . I am afraid you must have had a shocking walk . ”</said></p>
   <p n="22"><said who="Knightley">“ Not at all , sir . It is a beautiful moonlight night ; and so mild that I must draw back from your great fire . ”</said></p>
   <p n="23"><said who="Woodhouse">“ But you must have found it very damp and dirty . I wish you may not catch cold . ”</said></p>
   <p n="24"><said who="Knightley">“ Dirty , sir ! Look at my shoes . Not a speck on them . ”</said></p>
 </text>
</TEI>
'''
soup = BeautifulSoup(TEI,"lxml")

1. Above there is an XML snippet of Jane Austen similar (but not identical, note for instance the lack of divs) to what you're tackling on the lab. First, programmatically extract the name of the author from this snippet.

In [36]:
soup.find("author").get_text()

'Jane Austen'

2. Check to see if paragraph no. 22 in the text contains a "said" tag.

In [37]:
para = soup.find(n="22")

for node in para.contents:
    if(node.name == "said"):
        print("True")

True


Alternatively, we could write a function:

In [38]:
def contains_said(paragraph):
    for child in paragraph.contents:
        if(child.name == "said"):
            return True
        else:
            return False
        
contains_said(soup.find(n="22"))

True

3. Each `said` tag contains a `who` attribute who said the line. Create a Python dictionary (ideally a `defaultdict`) where the speakers are keys, and the values are a list of their lines of dialogue.

In [39]:
from collections import defaultdict

speakers = defaultdict(list)

for node in soup.find_all("said"):
    speaker = node["who"]
    speakers[speaker].append(node.get_text())
    
print(speakers)

defaultdict(<class 'list'>, {'Woodhouse': ['“ It is very kind of you , Mr. Knightley , to come out at this late hour to call upon us . I am afraid you must have had a shocking walk . ”', '“ But you must have found it very damp and dirty . I wish you may not catch cold . ”'], 'Knightley': ['“ Not at all , sir . It is a beautiful moonlight night ; and so mild that I must draw back from your great fire . ”', '“ Dirty , sir ! Look at my shoes . Not a speck on them . ”']})


4. From the MDS-CL page (https://masterdatascience.ubc.ca/programs/computational-linguistics), get a list of contentful (not just whitespace) strings which correspond to the text that appears in `p` HTML tags whose parent is a div that has a child in header style 2 whose `id` is "cl-curriculum"

In [40]:
soup = BeautifulSoup(urlopen("https://masterdatascience.ubc.ca/programs/computational-linguistics"), "lxml")
curriculum = []
for node in soup.find_all("p"):
    if(node.parent.name == "div"and node.parent.find("h2") and node.parent.find("h2")["id"] == "cl-curriculum"):
        if(not node.get_text().strip() == ""):
            curriculum.append(node.get_text())
            
print(curriculum)

['The program structure includes 24 one-credit courses offered in four-week segments. Courses are lab-oriented and delivered in-person with some blended online content.', 'At the end of the six segments, an eight-week, six-credit capstone project is also included, allowing students to apply their newly acquired knowledge, while working alongside other students with real-life data sets. Please note that instructors are subject to change.', '* subject to change at the discretion of the MDS Computational Linguistics program']


In [41]:
text = '       The program structure includes 24 one-credit courses offered in four-week segments      '
text.strip()

'The program structure includes 24 one-credit courses offered in four-week segments'

The strip() method removes any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove)