In [None]:
import requests

issue_url = "https://www.horizonte-magazin.ch/kategorie/ausgabe-124/"

response = requests.get(issue_url)
content = response.content

print(content)

# Crawling the Web and Manual Annotation

In this Exercise Session, we are going to look at how to extract our data directly from the web. The Internet is a gigantic resource, and one that can usually be leveraged quite easily. In comparison to the PDFs, the text is already encoded, and if we are lucky, the information is semantically structured inside the HTML-documents that make up Websites. (TLDR: **HTML > PDF**)

## Some Preliminaries for working with the Web

When you go in your browser, and type a url into the search bar, you'll go the website you're looking for. But what actually happens, is that the website comes to you. By typing in the url and pressing enter, you are making a request to the server where this website is hosted. If everything works out, the server then answers by sending you the information that make up the website. Your browser then interprets these informations, and displays what we all know and love as "Websites". 
You can actually inspect these informations, by pressing F12 in your browser (most browsers work that way). [More Information](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works) (TLDR: **The sites come to you**)

If you click a link on this website, the whole process starts again: There's a request to the server, and if it works out, the server sends you the requested data/part of the website. In some cases, the request brings a whole new site, that gets loaded in your browser. In some other cases, only a part of the site is loaded newly ([Ajax](https://en.wikipedia.org/wiki/Ajax_(programming))). If the latter is the case, we need more specialized tools to extract the data.

### Crawling and Scraping

The two ways a website can use to show you the data are the cause for two different types of data extraction from the web, **crawling** and **scraping**. 

- **Crawling**: Crawling means that we just send one GET-request to the website, and then save the HTML-document we get for this request. This is the only interaction we have with the server. Afterwards we parse the HTML-Document on our local machine to get the information we want. In the following example only this approach is used and needed.

- **Scraping**: When we do Web Scraping, we actually have a similar interaction with the Website and Server like a human user. This is sometimes necessary, when we are looking for data that only gets loaded to the page after we've scrolled down a bit, or after we have clicked some button. A good example of this behaviour are comments on news articles (e.g. [Blick](https://www.blick.ch/news/ausland/coronavirus-trump-greift-reporter-bei-pressekonferenz-an-id15833432.html)). Here, we need tools that automate a complete browser. Tools that can click, scroll and fill in forms just like humans would do. Python facilitates this with libraries such as [scrapy](https://scrapy.org/) or [selenium](https://selenium-python.readthedocs.io/)

This differentiation into "scraping" and "crawling" isn't used everywhere, and sometimes the terms are used interchangeably.

In some cases, the Website or Service we are trying to grab data from actually provides an API, an *Application Programming Interface*. These make it even easier to get Data, but get rarer and rarer. A great explanatory video on APIs and problems with them can be found [here](https://www.youtube.com/watch?v=BxV14h0kFs0). (TLDR: **APIs**)

Especially when scraping, we need to be cautious about legal boundaries. Just because there are ways to access data, doesn't mean we're allowed to. Many bigger websites actually mention "automated interaction" as something that is forbidden. Usually these clauses are there to prevent big data enterprises such as Cambridge Analytica to exploit the content, but you still need to be aware of this aspect of data extraction. (TLDR: **Legality**)

## The Goal

For this Session, we are going to extract text from the website of _Horizons_. Horizons is a quarterly magazine from the [Swiss National Science Foundation](http://www.snf.ch/en/researchinFocus/research-magazine-horizons/Pages/default.aspx). It is issued in three languages, available online and therefore gives a beautiful ground to build a multiparallel corpus in the science domain. (TLDR: **Horizons**)

The code I am going to use was partly developped by Tannon Kew, a Masters student here in Zurich.

In this session, we are not going to look at the alignment of the different languages. This will be done in the last session. For now we'll focus in a first step on the extraction of the data, and in the second half, we are going to look at the [brat](https://brat.nlplab.org/)-infrastructure for manual annotation. (TLDR: **brat**)

## Data Extraction

There will be five Steps towards our final goal:

1. HTML-Crawling
2. HTML to XML (in our desired format)
3. XML to TXT (for use in brat)
4. Annotation with Brat (web/local)
5. Integrate the Annotation back to the XML-Document
6. Celebrate! 

### Step 1: HTML-Crawling

First, let's manually find an issue we are interested in.

<img src="img/issue_de.png" alt=https://www.horizonte-magazin.ch/kategorie/ausgabe-124/ width=100% height=100% />

[This](https://www.horizonte-magazin.ch/kategorie/ausgabe-124/) looks good! As we can see, there are not yet the full articles on there, but tiles that direct us to those articles. I now want a list with all the **URL**s that lead to these articles. After I have this list, I can just iterate over it and then download the articles one by one. (TLDR: **Find an Index Page**)

In [1]:
import requests

issue_url = "https://www.horizonte-magazin.ch/kategorie/ausgabe-124/"

response = requests.get(issue_url)
content = response.content

print(content)

b'\n<!DOCTYPE HTML>\n<html lang="de-DE">\n<head>\n    <meta charset="UTF-8"/>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n\t<meta name="viewport" content="width=device-width, initial-scale=1"/>\n    <link href="https://fonts.googleapis.com/css?family=Lato" rel="stylesheet">\n    <link rel="stylesheet" type="text/css" href="/fonts/ff-meta/MyFontsWebfontsKit.css">\n    <link rel="stylesheet" href="https://wf.typotheque.com/WF-030340-010104" type="text/css" />\n    <meta name="twitter:card" content="summary_large_image" />\n    <meta name="twitter:site" content="@horizonte_de" />\n    <meta name="twitter:creator" content="@horizonte_de" />\n    <meta name="twitter:title" content="Das Fundament bleibt ungewiss">\n    <meta name="twitter:description" content="<p>Klar baut die Theologie auf Glauben. Aber selbst in Physik und Mathematik ist das Wissen nicht rein. Wie kritisch hinterfragen Forschende ihre eigenen Grundlagen? Eine Entdeckungsreise durch die Disziplinen.</p>\n" />

Thats how our Browser sees the website, before he makes it look nice. Somewhere in there are the URLs we are interested in. In the next snippet, I am going to instruct the computer on where to look for them. To find out where to computer needs to look, we use the inspect tool in our browser, usually accessible via F12. If you're using safari, it's a bit different. In the inspect tool we can hover over different elements in the HTML-Code, and see the respective element in the graphical representation. By doing that, we can navigate to our desired elements. We are specifically looking for `a`-tags with `href`-attributes, as these contain URLs. ([Reference](https://www.w3schools.com/tags/tag_a.asp))

<img src="img/issue_de_inspect.png" alt=https://www.horizonte-magazin.ch/kategorie/ausgabe-124/ width=100% height=100% />

Now, we have located the desired element. Every article-tile has its `<article>`-tag, and in there, there's an `<a>`-Tag containing the Link.

To navigate the HTML-Document (that is at that point just a python-string), I will be using [lxml](https://lxml.de/)s [html](https://lxml.de/lxmlhtml.html)-package. Furthermore, I am using an XML-Element-Location Language called [Xpath](https://www.w3schools.com/xml/xpath_intro.asp). It is actually quite straightforward, so don't hesitate to have a look into the linked tutorial. (TLDR: **Find relevant HTML-Elements**)


In [2]:
from lxml import html

root = html.fromstring(content)

article_links = root.xpath("//h4[@class='post-title']/a/@href")

print(article_links)


['https://www.horizonte-magazin.ch/2020/03/05/am-anfang-und-am-ende-bleibt-vieles-ungewiss/', 'https://www.horizonte-magazin.ch/2020/03/05/wo-grundlagenforschung-ein-luxus-ist/', 'https://www.horizonte-magazin.ch/2020/03/05/jede-neue-generation-von-forschenden-muss-hinterfragen-was-sie-glaubt/', 'https://www.horizonte-magazin.ch/2020/03/05/polarstern-ahoi/', 'https://www.horizonte-magazin.ch/2020/03/05/wie-die-prionen-ein-dogma-stuerzten/', 'https://www.horizonte-magazin.ch/2020/03/05/kommunikation-geht-durch-den-magen/', 'https://www.horizonte-magazin.ch/2020/03/05/nun-sag-wie-hast-du-es-mit-dem-glauben/', 'https://www.horizonte-magazin.ch/2020/03/05/kosmisches-netzwerk-kartiert/', 'https://www.horizonte-magazin.ch/2020/03/05/bilderstrecke-wie-ein-phoenix-aus-dem-eis/', 'https://www.horizonte-magazin.ch/2020/03/05/tierversuch-bewilligt/', 'https://www.horizonte-magazin.ch/2020/03/05/dem-die-folterer-vertrauen/', 'https://www.horizonte-magazin.ch/2020/03/05/elektronik-zum-biegen-und-da

In [3]:
my_article_url = article_links[0]

response = requests.get(my_article_url)
content = response.content

print(content)

b'<!DOCTYPE HTML>\n<html lang="de-DE">\n<head>\n    <meta charset="UTF-8"/>\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n\t<meta name="viewport" content="width=device-width, initial-scale=1"/>\n    <link href="https://fonts.googleapis.com/css?family=Lato" rel="stylesheet">\n    <link rel="stylesheet" type="text/css" href="/fonts/ff-meta/MyFontsWebfontsKit.css">\n    <link rel="stylesheet" href="https://wf.typotheque.com/WF-030340-010104" type="text/css" />\n    <meta name="twitter:card" content="summary_large_image" />\n    <meta name="twitter:site" content="@horizonte_de" />\n    <meta name="twitter:creator" content="@horizonte_de" />\n    <meta name="twitter:title" content="Das Fundament bleibt ungewiss">\n    <meta name="twitter:description" content="<p>Klar baut die Theologie auf Glauben. Aber selbst in Physik und Mathematik ist das Wissen nicht rein. Wie kritisch hinterfragen Forschende ihre eigenen Grundlagen? Eine Entdeckungsreise durch die Disziplinen.</p>\n" />\n

We might want to save the plain HTML alongside the extracted data, just for the case we've missed anything and for documentations sake. (TLDR: **Save HTML**)

I mentioned before, that the horizons magazine is available in three languages. It would be a great benefit to the data we are collecting if we have it in all of those languages, as this enables us to build a parallel corpus. Oftentimes, parallel data isn't as available as in this case. (TLDR: **Leverage what you can**)

If we take a closer look at the article-page, we see that in the header there are buttons to the english and the french version of the article. We could just grab the links that are hidden in these buttons, download their HTML as well. But there is even a better way to find this alternate versions of the file. We could look for the `hreflang`-metaelement ([Reference](https://en.wikipedia.org/wiki/Hreflang)). This meta-element is conventionalized to point to different language versions of the same resource. It isn't displayed for the user, but can be interpreted by browsers and automatic applications. Because it is directed towards machines, it is safer to use for our purposes, as it follows a standard. (TLDR: **Be aware of Metaelements**)

In [4]:
html_root = html.fromstring(content)
english_article_url = html_root.xpath("//link[@rel='alternate' and @hreflang='en-US']/@href")
french_article_url = html_root.xpath("//link[@rel='alternate' and @hreflang='fr-FR']/@href")
print(english_article_url)
print(french_article_url)

['https://www.horizons-mag.ch/2020/03/05/the-sea-of-faith-in-an-ocean-of-science/']
['https://www.revue-horizons.ch/2020/03/05/le-fondement-demeure-toujours-incertain/']


For this session, we are not interested in building a multilingual corpus, so we won't use these links for now. Just be aware that the data you're looking for is sometimes linked together, and it can be rewarding to look out for these links and connections. If you discover a pattern in how resources are linked together, this can easily be scaled and used in a lean and efficent crawler. (TLDR: **Look for patterns**)

Also be aware that usually, your code shouldn't just work for one single HTML-document. In this example, we maybe want to crawl all issues, and then all articles in all languages of these issues. If only one of these articles doesn't follow the same pattern as the one we've used to write our crawler, it might throw an error and halt everything. This is undesired, because it is *very likely* that we will encounter some messy data. It's not as bad as with the PDFs, but these websites are still built maintained by stressed out and distracted humans. Robust code could be achieved by [exception handling](https://realpython.com/python-exceptions/#the-try-and-except-block-handling-exceptions) and comprehensive logging. (TLDR: **Write robust code**)

Now, we will begin constructing the output XML-file. First, we identify some metainformation.

In [5]:
import re

issue = re.search(r"ausgabe-(\d+)", issue_url).group(1)
lang = html_root.attrib["lang"][:2]
author = html_root.xpath("//a[@rel='author']/text()")[0].replace(' ', '-')
title = html_root.xpath("//h4[@class='post-title']/text()")[0].replace(' ', '-')
date = re.search(r"\d\d\d\d/\d\d/\d\d", my_article_url).group().replace("/", "-")

print(issue, lang, author, title, date)

124 de Florian-Fisch Das-Fundament-bleibt-ungewiss 2020-03-05


Now, we start constructing the XML-Output, by using the etree module of lxml.

In [6]:
from lxml import etree

a_id = 0

article_element = etree.Element("article")

article_element.set("id", "a"+str(a_id))
article_element.set("issue", issue)
article_element.set("lang", lang)
article_element.set("date", date)

titlechild = etree.SubElement(article_element, "div", attrib={"class": "title"})
titlechild.text = title.replace("-", " ")

authorchild = etree.SubElement(article_element, "div", attrib={"class": "author"})
authorchild.text = author

abstract = html_root.xpath("//div[@class='post-lead']")[0].text_content()
abstractchild = etree.SubElement(article_element, "div", attrib={"class": "abstract"})
abstractchild.text = abstract

tree_preview = etree.ElementTree(article_element)
print(etree.tostring(tree_preview, pretty_print=True, xml_declaration=True, encoding="utf-8").decode("utf-8"))

<?xml version='1.0' encoding='utf-8'?>
<article id="a0" issue="124" lang="de" date="2020-03-05">
  <div class="title">Das Fundament bleibt ungewiss</div>
  <div class="author">Florian-Fisch</div>
  <div class="abstract">Klar baut die Theologie auf Glauben. Aber selbst in Physik und Mathematik ist das Wissen nicht rein. Wie kritisch hinterfragen Forschende ihre eigenen Grundlagen? Eine Entdeckungsreise durch die Disziplinen.
</div>
</article>



Now the article itself. We first try out the simple ```text_content()```-function from lxml, which just returns the text contents of all descendants of an element. 

In [7]:
html_excerpt_element = html_root.xpath("//div[@class='post-excerpt']")[0]
excerpt = html_excerpt_element.text_content()
print(excerpt)


							Wer im Mittelalter durch den Rand des Himmels hätte kriechen und das geheimnisvolle Jenseits betrachten können, hätte einen Beleg für seinen Glauben gefunden. | Bild: Wikimedia
Wissenschaft bestätigt oder verwirft ihre Theorien und Thesen anhand von Beobachtungen, etwa aus Experimenten. Im Forschungsalltag und in der konkreten Anwendung von Forschung gelten die Regeln der Empirie und der Nachvollziehbarkeit. Doch was innerhalb der verschiedenen Disziplinen jeweils funktioniert, kann zur Glaubenssache werden. Von ausserhalb betrachtet, stehen manche Grundsätze, Theorien und Modelle auf wackligen Beinen. Horizonte nimmt eine Stichprobe in den Disziplinen, um zu verstehen, wo in der Wissenschaft der Glaube anfängt. Oder aufhört. Wo werden Dinge für wahr gehalten, ohne dass überprüfbare Gründe dafürsprechen?
PHYSIK: Stringtheorie soll vereinen
Die beiden Schweizer Nobelpreisträger Didier Queloz und Michel Mayor hatten 1995 den Exoplaneten 51 Pegasi b entdeckt und damit die damals g

It doesn't really yield a clean enough result. Luckily, it doesn't take much to refine the extraction ourselves. First, we inspect the article again and try to figure out, in what subelements the relevant text is stored.

In [8]:
content = html_excerpt_element.xpath("./p | ./h5 | .//div[@class='su-heading-inner']")

excerptchild = etree.SubElement(article_element, "div", attrib={"class": "excerpt"})

for element in content:
    if len(element.text) > 1:
        par = etree.SubElement(excerptchild, "p")
        par.text = element.text_content()
        print(par.text)

Wissenschaft bestätigt oder verwirft ihre Theorien und Thesen anhand von Beobachtungen, etwa aus Experimenten. Im Forschungsalltag und in der konkreten Anwendung von Forschung gelten die Regeln der Empirie und der Nachvollziehbarkeit. Doch was innerhalb der verschiedenen Disziplinen jeweils funktioniert, kann zur Glaubenssache werden. Von ausserhalb betrachtet, stehen manche Grundsätze, Theorien und Modelle auf wackligen Beinen. Horizonte nimmt eine Stichprobe in den Disziplinen, um zu verstehen, wo in der Wissenschaft der Glaube anfängt. Oder aufhört. Wo werden Dinge für wahr gehalten, ohne dass überprüfbare Gründe dafürsprechen?
PHYSIK: Stringtheorie soll vereinen
Die beiden Schweizer Nobelpreisträger Didier Queloz und Michel Mayor hatten 1995 den Exoplaneten 51 Pegasi b entdeckt und damit die damals gängige Theorie zur Planetenentstehung über den Haufen geworfen. Genau so sollen wissenschaftliche Theorien gemäss dem Philosophen Karl Popper falsifiziert werden können.
Doch nicht imme

In [9]:
tree = etree.ElementTree(article_element)
print(etree.tostring(tree, pretty_print=True, xml_declaration=True, encoding="utf-8").decode("utf-8"))

<?xml version='1.0' encoding='utf-8'?>
<article id="a0" issue="124" lang="de" date="2020-03-05">
  <div class="title">Das Fundament bleibt ungewiss</div>
  <div class="author">Florian-Fisch</div>
  <div class="abstract">Klar baut die Theologie auf Glauben. Aber selbst in Physik und Mathematik ist das Wissen nicht rein. Wie kritisch hinterfragen Forschende ihre eigenen Grundlagen? Eine Entdeckungsreise durch die Disziplinen.
</div>
  <div class="excerpt">
    <p>Wissenschaft bestätigt oder verwirft ihre Theorien und Thesen anhand von Beobachtungen, etwa aus Experimenten. Im Forschungsalltag und in der konkreten Anwendung von Forschung gelten die Regeln der Empirie und der Nachvollziehbarkeit. Doch was innerhalb der verschiedenen Disziplinen jeweils funktioniert, kann zur Glaubenssache werden. Von ausserhalb betrachtet, stehen manche Grundsätze, Theorien und Modelle auf wackligen Beinen. Horizonte nimmt eine Stichprobe in den Disziplinen, um zu verstehen, wo in der Wissenschaft der Glaub

This now only needs to be written to a file, and we can build a beautiful, nicely structured corpus!

In [10]:
with open(f'{issue}_{lang}_{title}_{date}.xml', 'wb') as outfile:
    tree.write(outfile, pretty_print=True, xml_declaration=True, encoding="utf-8")

For the following steps, we are going to leave the notebook. First, we'll use the script `get_text_to_annotate.py` to bring the XML into a format we can import to our annotation tool, brat. Then we'll do the annotation in brat, export and put the annotations back into XML with the script `reintegrate_annotation.py`. Both scripts can be used as command-line-tools, and provide some documentation if called with `-h`.