# Chapter 8: Webscraping

---
## Webscraping and parsing

The web is a giant repository of (textual) information and the need to extract information from it automatically is clear. One very useful application of scripts is for processing online data. 

Web scraping is a computer software technique of **extracting information from websites**. This involves 1) fetching the HTML document of a webpage and 2) parsing: the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet).

Python has many excellent pre-made software suites (or *libraries*) and is thus ideal for this purpose. Today we will give a few examples using the `requests` library for fetching pages, and the `bs4` (BeautifulSoup) library for parsing HTML.

Let's take these libraries for a spin! Suppose we want to know what LT3 researchers like to write about, by collecting a corpus of paper abstracts from the [LT3 publications page](https://www.lt3.ugent.be/publications/). First, we want to crawl links to individual paper pages, where the abstracts can be found.

## Fetching a webpage

In [9]:
import requests
import bs4

r = requests.get('https://www.lt3.ugent.be/publications') # This will store a "Response" object in r
html = r.text # Response objects have an attribute "text" which contains the HTML of the page
print(html)


<!doctype html>
<html lang="en">
    <head>
        <link rel="alternate" title="LT³ - Language and Translation Technology Team" type="application/rss+xml" href="/rss/"/>
        <meta charset="UTF-8">
        <title>LT&sup3; publications</title>
        <meta name="viewport" content="width=device-width, initial-scale=1"/>
        
            
                <link rel="stylesheet" href="/static/css/ugent_huisstijl.css">
                <link rel="stylesheet" href="/static/css/overwrites.css">
            
        
        <link href="/static/images/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    </head>

    <body>
        <div class="fluid-container">
            <div class="row">
                <header class="pageheader col-xs-12 ">
                    <nav class="navbar navbar-default">
                        <div class="row">
                            <div class="navbar-header col-xs-12 col-sm-2">
                                <div class="page-logo">
          

In the above code block we first import the requests and beautifulsoup4 libraries, which came pre-installed in the Conda distribution.

We then fetch the webpage from the URL using the get() function. This returns a "Response" object which contains the HTML text and meta-data of the page. When we extract the HTML only, we can start to parse it. But first we have a short look at the HTML language.

## Introduction to HTML
HTML stands for HyperText Markup Language, and is the language used to describe (markup) web pages. It is the underlying language to structure web-page content. HTML itself does not determine the way things look – it only helps to classify and structure content. It is your browser's job to receive and process the HTML so they can be displayed.


### HTML Elements
A HTML document is a structured representation of content which is subdivided in elements. Elements are the building blocks of a page. HTML tags label pieces of content such as "heading", "paragraph", "table", and so on. Browsers do not display the HTML tags, but use them to render the content of the page.

Elements are identified by ‘tags’, their name.
HTML tags usually come in pairs: the opening tag <tagname> and a closing tag with a forward slash for the tagname <\tagname>.
Tags can have an inner text and “attributes” (named properties):
```
<tag attribute=”value”>text</tag>

Common tags:
    <html> – the whole document
    <body> – the human-readable part of the web page
    <table> – the frame of a table element
    <tr> – a row in a table
    <td> – a cell of content inside a row
    <th> – a table header cell inside a row
```

A web page contains of a series of these elements which can be recursively embedded into eachother. Here is an example of a very simple page:

```
 <!DOCTYPE html>
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>

        <h1>My First Heading</h1>
        <p>My first paragraph.</p>

    </body>
</html> 
```

You can view the HTML source of any webpage in your browser by right clicking and selecting "View source" from the menu.

### HTML links
HTML links are elements 
defined with the `<a>` tag. The URL itself the value assigned to the href attribute.
```
<a href="https://www.w3schools.com">This is a link</a>
```

### More resources
Here are some tutorials on HTML, if you want to learn more out of interest or because you will be scraping for your final project.

W3School introduction to HTML: https://www.w3schools.com/html/


## Parsing the page
You do not need to fully understand HTML to start webscraping as a lot of the heavy lifting in the exercise will be done by the BeautifulSoup4 library. BeautifulSoup will process the raw HTML text document so we can easily obtain elements and content from it.

In [1]:
import requests
import bs4

r = requests.get('https://www.lt3.ugent.be/publications') # This will store a "Response" object in r
html = r.text # Response objects have an attribute "text" which contains the HTML of the page
soup = bs4.BeautifulSoup(html, "lxml") # This returns a parsed HTML object
links = []
for link in soup.find_all("a"): # Try help(soup.find_all) to understand what this does.
                                # It takes a lot of work out of your hands!
    print(link["href"])
    links.append(link["href"]) # Let's store all the links found on this page for later

/
#
#
https://ugent.be
/admin/login/
/news/
/projects/
/publications/
/people/
/tools/
/courses/
/contact/
/
/publications/how-do-students-cope-with-machine-translation-outp/
/publications/dutch-compound-splitting-for-bilingual-terminology/
/publications/scate-taxonomy-and-corpus-of-machine-translation-e/
/publications/translationese-and-post-editese-how-comparable-is-/
/publications/translation-methods-and-experience-a-comparative-a/
/publications/an-automatic-part-of-speech-tagger-for-middle-low-/
/publications/cartoons-as-interdiscourse-a-quali-quantitative-an/
/publications/a-neural-network-architecture-for-detecting-gramma/
/publications/semeval-2016-task-13-taxonomy-extraction-evaluatio/
/publications/the-effectiveness-of-consulting-external-resources/
/publications/a-translation-robot-for-each-translator-a-comparat/
/publications/computers-en-plagiaatdetectie/
/publications/the-many-aspects-of-fine-grained-sentiment-analysi/
/publications/rude-waiter-but-mouthwatering-pastries-a

We obtain the URL links by finding all `<a>` tags and getting their href value.

We now have a list of all the links that we found on `https://www.lt3.ugent.be`. Not all of them are links to individual paper pages, and they are also relative links. Let's fix that!

For the first problem, we can use the fact that all individual pages have a link of the form `/publications/slug-of-publication-name/`. In other words, they all start with `/publications/` and end with a trailing slash. Note that the link `/publications/` also satisfies these conditions, so we need to make sure that there are exactly 3 slashes. Of course, you could come up with other rules for getting the right links, this is just one solution.

This is a perfect time for using list comprehensions! We have a list of links, and we want to improve it.

In [2]:
links = [l for l in links if l.startswith("/publications/")]
links = [l for l in links if l.endswith("/")]
links = [l for l in links if l.count("/") == 3]
print(links)

['/publications/how-do-students-cope-with-machine-translation-outp/', '/publications/dutch-compound-splitting-for-bilingual-terminology/', '/publications/scate-taxonomy-and-corpus-of-machine-translation-e/', '/publications/translationese-and-post-editese-how-comparable-is-/', '/publications/translation-methods-and-experience-a-comparative-a/', '/publications/an-automatic-part-of-speech-tagger-for-middle-low-/', '/publications/cartoons-as-interdiscourse-a-quali-quantitative-an/', '/publications/a-neural-network-architecture-for-detecting-gramma/', '/publications/semeval-2016-task-13-taxonomy-extraction-evaluatio/', '/publications/the-effectiveness-of-consulting-external-resources/', '/publications/a-translation-robot-for-each-translator-a-comparat/', '/publications/computers-en-plagiaatdetectie/', '/publications/the-many-aspects-of-fine-grained-sentiment-analysi/', '/publications/rude-waiter-but-mouthwatering-pastries-an-explorat/', '/publications/all-mixed-up-finding-the-optimal-featur

In [3]:
# Now, let's prefix all the links with the domain name, so we get absolute links instead of relative ones
links = ["https://www.lt3.ugent.be" + l for l in links] # Simple string concatenation!
print(links)

['https://www.lt3.ugent.be/publications/how-do-students-cope-with-machine-translation-outp/', 'https://www.lt3.ugent.be/publications/dutch-compound-splitting-for-bilingual-terminology/', 'https://www.lt3.ugent.be/publications/scate-taxonomy-and-corpus-of-machine-translation-e/', 'https://www.lt3.ugent.be/publications/translationese-and-post-editese-how-comparable-is-/', 'https://www.lt3.ugent.be/publications/translation-methods-and-experience-a-comparative-a/', 'https://www.lt3.ugent.be/publications/an-automatic-part-of-speech-tagger-for-middle-low-/', 'https://www.lt3.ugent.be/publications/cartoons-as-interdiscourse-a-quali-quantitative-an/', 'https://www.lt3.ugent.be/publications/a-neural-network-architecture-for-detecting-gramma/', 'https://www.lt3.ugent.be/publications/semeval-2016-task-13-taxonomy-extraction-evaluatio/', 'https://www.lt3.ugent.be/publications/the-effectiveness-of-consulting-external-resources/', 'https://www.lt3.ugent.be/publications/a-translation-robot-for-each-t

Now that we have a list of absolute links it is time to get our abstracts! By looping over all these links, downloading each page and extracting what we want from it, we can fill a list with abstracts.

#### DIY 1
Open https://www.lt3.ugent.be/publications/how-do-students-cope-with-machine-translation-outp/ > `right-click` > `View Source`. Which element contains the abstract? What are its attribute and values?

#### DIY 2
Your browser has an even faster way of inspecting HTML elements we are interested in extracting: `Right click directly unto the content` > `Inspect (element)`. This will bring directly to the tag containing the content in the source.

Open https://www.lt3.ugent.be/publications/how-do-students-cope-with-machine-translation-outp/, inspect the publication's title. Which element contains the title?

In [4]:
# let's get the abstracts from the elements
abstracts = []
for link in links:
    print(link)
    r = requests.get(link) # get the request object for the link
    soup = bs4.BeautifulSoup(r.text, "lxml")
    abstract_div = soup.find("div", {"class": "full-text"}) # find div elements with "textfield" as class attribute
    print(abstract_div)
    if abstract_div: # Sometimes this is None, when a publication page has no abstract
        # We are only interested in the text of the abstract, not the HTML code around it (div and p tags)
        # BeautifulSoup objects make it very easy to access child elements of an HTML object, using dot notation
        abstract = abstract_div.p.text # because every abstract_div contains a p containing text
        abstracts.append(abstract)
print("Finished crawling", len(abstracts), "abstracts!")

https://www.lt3.ugent.be/publications/how-do-students-cope-with-machine-translation-outp/
<div class="full-text">
<p>In this chapter, we take a closer look at students' post-editing of multi-word units (MWUs) from English into Dutch. The data consists of newspaper articles post-edited by translation students as collected by means of advanced keystroke logging tools. <br/>We discuss the quality of the machine translation (MT) output for various types of MWUs, and compare this with the final post-edited quality. In addition, we examine the external resources consulted for each type of MWU. Results indicate that contrastive MWUs are harder to translate for the MT system, and harder to correct by the student post-editors than non-contrastive MWUs. We further find that consulting a variety of external resources helps student post-editors solve MT problems.</p>
</div>
https://www.lt3.ugent.be/publications/dutch-compound-splitting-for-bilingual-terminology/
<div class="full-text">
<p>Compound

None
https://www.lt3.ugent.be/publications/rude-waiter-but-mouthwatering-pastries-an-explorat/
<div class="full-text">
<p>The fine-grained task of automatically detecting all sentiment expressions within a given document and the aspects to which they refer is known as aspect-based sentiment analysis. In this paper we present the first full aspect-based sentiment analysis pipeline for Dutch<br/>and apply it to customer reviews. To this purpose, we collected reviews from two different domains, i.e. restaurant and smartphone reviews. Both corpora have been manually annotated using newly developed guidelines that comply to standard practices in the field. For our experimental pipeline we perceive aspect-based sentiment analysis as a task consisting of three main subtasks which have to be tackled incrementally: aspect term extraction, aspect category classification and polarity classification. First experiments on our Dutch restaurant corpus reveal that this is indeed a feasible approach th

<div class="full-text">
<p>This study characterises and problematises the language of corporate reporting along region, industry, genre, and content lines by applying readability formulae and more advanced natural language processing (NLP)–based analysis to a manually assembled 2.75-million-word corpus. Readability formulae reveal that, despite its wider readership, sustainability reporting remains a very difficult to read genre, sometimes more difficult than financial reporting. Although we find little industry impact on readability, region does prove an important variable, with NLP-based variables more strongly affected than formulae. These results not only highlight the impact of legislative contexts but also language variety itself as an underexplored variable. Finally, the study reveals some of the weaknesses of default readability formulae, which are largely unable to register syntactic variation between the varieties of English in the reports and demonstrates the merits of NLP i

None
https://www.lt3.ugent.be/publications/the-efficacy-of-terminology-extraction-systems-f-2/
<div class="full-text">
<p>This article investigates whether the integration of a domain-specific, bilingual glossary supports audiovisual translators of documentaries in terms of translation process time and terminological errors. After a short review of issues typical of documentary translation and a discussion of the use of translation-memory software in general, the reference corpora are described. Next, a manually labeled glossary is created and its constitution is explained with special emphasis on the criteria used to qualify what a term is, or is not. This glossary is then used as a gold standard to calculate the rate of agreement with the glossary of three automatic terminology-extraction systems. Finally, experiments with Master's students demonstrate how both glossaries (the gold standard and one automatically extracted glossary) reduce their process time significantly but not the 

Hooray, we collected a bunch of abstracts! Let's have some fun with them!

In [5]:
## Make a frequency dictionary
freq_dict = {}
for abstract in abstracts:
    words = abstract.lower().split()
    for word in words:
        freq_dict[word] = freq_dict.get(word, 0) + 1

## Find the most frequent words
freq_tuples = list((freq, word) for word, freq in freq_dict.items()) # We need to have the frequency first
freq_tuples.sort() # Because that is what we want to sort on
freq_tuples.reverse() # In descending order
print("The top 30 most frequently used words are:")
for freq, word in freq_tuples[:30]:
    print(freq, word)

# Get the tuples for which the word is longer than 5 characters (to get fewer function words)
freq_tuples = [tup for tup in freq_tuples if len(tup[1]) > 5]
print("The top 30 most frequently used words longer than 5 characters are:")
for freq, word in freq_tuples[:30]:
    print(freq, word)
    
abstract_word_lengths = [len(a.split()) for a in abstracts]
print("Average abstract length in words:", sum(abstract_word_lengths)/len(abstract_word_lengths))


The top 30 most frequently used words are:
242 the
188 of
156 and
136 a
110 to
102 in
100 we
78 for
72 on
52 this
49 that
44 is
36 as
35 with
27 which
26 our
25 are
24 translation
22 semantic
22 more
21 post-editing
20 readability
19 dutch
18 using
18 from
18 features
17 system
17 sentiment
17 by
17 both
The top 30 most frequently used words longer than 5 characters are:
24 translation
22 semantic
21 post-editing
20 readability
18 features
17 system
17 sentiment
16 information
16 different
15 quality
15 present
15 approach
14 results
14 machine
14 language
13 analysis
12 aspect
11 research
11 performance
11 learning
11 fine-grained
11 errors
11 detection
11 corpus
11 coreference
10 standard
10 feature
9 systems
9 experiments
9 cyberbullying
Average abstract length in words: 157.83333333333334


---
## Working with libraries

In the previous scraping example we have worked with the requests and BeautifulSoup4 software libraries. Libraries are a staple of a programmer's daily work as they provide ready-made functionality that is commonly needed.

Because libraries are usually written by someone else, it is important to be aware of the exact behaviour and limitions of their functions. This is why good libraries have **documentation** which describes their functionality in detail.

#### DIY 3
Go the BeautifulSoup4 documentation website: https://beautiful-soup-4.readthedocs.io/.

Search for where the .find_all() function is described.

Adapt the code below by *adding arguments to soup.find_all() so that only 3 elements are returned*

In [19]:
r = requests.get('https://www.lt3.ugent.be/publications')
html = r.text 
soup = bs4.BeautifulSoup(html, "lxml")
for link in soup.find_all("a"): # add a function argument here so only 3 elements are returned
    print(link["href"])

/
#
#
https://ugent.be
/admin/login/
/news/
/projects/
/publications/
/people/
/tools/
/courses/
/contact/
/
/publications/how-do-students-cope-with-machine-translation-outp/
/publications/dutch-compound-splitting-for-bilingual-terminology/
/publications/scate-taxonomy-and-corpus-of-machine-translation-e/
/publications/translation-methods-and-experience-a-comparative-a/
/publications/translationese-and-post-editese-how-comparable-is-/
/publications/an-automatic-part-of-speech-tagger-for-middle-low-/
/publications/cartoons-as-interdiscourse-a-quali-quantitative-an/
/publications/a-neural-network-architecture-for-detecting-gramma/
/publications/semeval-2016-task-13-taxonomy-extraction-evaluatio/
/publications/the-effectiveness-of-consulting-external-resources/
/publications/a-translation-robot-for-each-translator-a-comparat/
/publications/all-mixed-up-finding-the-optimal-feature-set-for-2/
/publications/computers-en-plagiaatdetectie/
/publications/rude-waiter-but-mouthwatering-pastries-a

## Working with objects
Python is an object-oriented programming language.
- brief explanation of OO
- what are objects
- working with objects
- do not go into much detail

###

## Final exercises
Print all the table tags from the following page: https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India

In [6]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India')
html = r.text
soup = bs4.BeautifulSoup(html, "lxml")

# your code here
for table in soup.find_all("table"):
    print(table)

<table class="vertical-navbox nowraplinks" style="float:right;clear:right;width:22.0em;margin:0 0 1.0em 1.0em;background:#f9f9f9;border:1px solid #aaa;padding:0.2em;border-spacing:0.4em 0;text-align:center;line-height:1.4em;font-size:88%">
<tr>
<th style="padding:0.2em 0.4em 0.2em;font-size:145%;line-height:1.2em"><a href="/wiki/States_and_union_territories_of_India" title="States and union territories of India">States and union<br/>
territories of India</a><br/>
ordered by</th>
</tr>
<tr>
<td style="padding:0.2em 0 0.4em">
<div class="center">
<div class="floatnone"><a class="image" href="/wiki/File:Flag_of_India.svg"><img alt="Flag of India.svg" data-file-height="900" data-file-width="1350" height="47" src="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/70px-Flag_of_India.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/105px-Flag_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/140px-Flag_of_In

------------------------------

You've reached the end of Chapter 7! You can safely ignore the code below, it's only there to make the page pretty:

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()