# Data Crawling

## Andreas Stöckl

###  University of Applied Sciences Upper Austria


Data that is not made available in structured form, for example via APIs, can be obtained via so-called "data crawling" or "web scraping". This involves software retrieving and storing the web pages as they are made available for display in browsers. As a result of such a page retrieval, we receive the HTML source text of the page as a plain text file.

However, the data obtained in this way has the problem that it often contains other data that is needed for display or navigation in addition to the data actually required. This often leaves you with a lot of HTML code that then has to be processed.

Searching for and extracting the desired text passages using text analysis functions, such as regular expressions, is usually a complex and time-consuming task that is also prone to errors.

Specialised packages such as "BeautifulSoup" provide a remedy here.

### BeautifulSoup 

Beautiful Soup is a Python library for reading data from HTML and XML files. It offers possibilities for navigating, searching and changing the parse tree. It usually saves hours or days of work.

It provides the following functions, for example:

- Find the occurrence of a specific HTML element (using CSS class name) and extract the content or attributes.
- Find all occurrences of HTML elements.
- Navigate in the HTML tree along its edges. (e.g. find all child nodes of an HTML element).
- Modify the HTML tree, for example by inserting elements.

The documentation of "BeautifulSoup" can be found at https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

There are a variety of other tools such as:
- Scrapy - https://scrapy.org/
- Selenium https://www.selenium.dev/

### Scrapy 

The library does not only allow you to crawl HTML pages and extract parts of them, but as a framework it also offers the possibility to write complete web crawling applications that read the contents and follow the links contained in them in order to make their contents available.

### Selenium

Selenuim is a web browser automation software that is often used for automated testing. But it can also be useful for web scraping. It is always used when calling up and evaluating pages by means of requests is not sufficient. If actions in the browser, such as mouse clicks, are necessary to access the information, then controlling the browser with tools such as Selenium is useful.

## Case study

As a hotel owner, you want to keep track of the reviews that the guests of your hotel and those of your competitors. To do this, you want to collect the texts of these reviews. These reviews are on various hotel booking and hotel review portals, but they do not provide APIs to retrieve this information. Therefore, it is a typical use case for webscraping.

In the following I would like to show how to download and collect the texts of hotel reviews from tripadvisor.com. Let's take a look at an example hotel:

![Bewertungen](tripadvisor.png)

The evaluations can be found at the address:
https://www.tripadvisor.com/Hotel_Review-g651973-d6978275-Reviews-Ikos_Olivia-Gerakini_Halkidiki_Region_Central_Macedonia.html

As a first step, we import BeautifulSoup "bs4" using pip.

In [3]:
!pip install bs4



Then we import the module together with a package to be able to make calls from web pages.

In [4]:
import urllib.request, urllib.error
from bs4 import BeautifulSoup

We call up the corresponding page with the ratings and receive a so-called "request object". With this object we can try to read the page. Either the content of the page is delivered and can be displayed, or a suitable error message is output.

More about this at: https://docs.python.org/3/howto/urllib2.html

In [5]:
url="https://www.tripadvisor.at/Hotel_Review-g651973-d6978275-Reviews-Ikos_Olivia-Gerakini_Halkidiki_Region_Central_Macedonia.html#REVIEWS"
req = urllib.request.Request(url, None)

try: 
    html = urllib.request.urlopen(req).read()
    print(html)
except urllib.error.URLError as e:
     print(e.reason) 



The result is a very confusing text file with special characters and a lot of HTML tags and scripts. It is difficult to find the ratings in this file, and it is also time-consuming to extract them.
This is where BeautifulSoup comes in, turning the text file into a structured data object that is easy to search.

In [6]:
soup = BeautifulSoup(html, 'html.parser')
print(soup)

<!DOCTYPE html>
<html lang="de-AT" xmlns:og="http://opengraphprotocol.org/schema/"><head><meta content="text/html; charset=utf-8" http-equiv="content-type"/><link href="https://static.tacdn.com/favicon.ico?v2" id="favicon" rel="icon" type="image/x-icon"/><link color="#000000" href="https://static.tacdn.com/img2/brand_refresh/application_icons/mask-icon.svg" rel="mask-icon" sizes="any"/><meta content="#34e0a1" name="theme-color"/><meta content="telephone=no" name="format-detection"/><script type="text/javascript">window.taRollupsAreAsync = true;</script><link crossorigin="" href="https://static.tacdn.com/css2/webfonts/TripSans/TripSans.css?v1.002" rel="stylesheet"/><title>IKOS OLIVIA: Bewertungen, Fotos &amp; Preisvergleich (Gerakini, Griechenland) - Tripadvisor</title><meta content="TripAdvisor" property="al:ios:app_name"/><meta content="284876795" property="al:ios:app_store_id"/><meta content="284876795" name="twitter:app:id:ipad" property="twitter:app:id:ipad"/><meta content="2848767

The output of the object doesn't look much different yet, but internally everything is ready to work with it.

We look in the source code of the page for the HTML tag and the CSS class name that enclose the text of the ratings. In this example, this is a div tag called "cPQsENeY". This manual search for the correct tags can be quite annoying on large pages, such as this one. It should also be remembered that this query is a fragile construct, as a structural change to the page, such as a change to the page template, will result in the script no longer working. Unfortunately, this has to be accepted in web scraping.

With the help of the method "find_all" of the "Soup" object we can now get a list of all these elements.

In [7]:
reviews = soup.find_all("div", class_="cPQsENeY")
print(reviews)

[<div class="cPQsENeY" style="max-height:242px;line-break:normal;cursor:auto"><div><p>Das Ikos Olivia ist eine ausgezeichnete Wahl, wenn Sie Gerakini besuchen möchten. Die Unterkunft bietet ein familienfreundliches Umfeld mit vielen Annehmlichkeiten für Reisende und überzeugt außerdem durch die ideale Kombination aus Preis-Leistung, Komfort und Bequemlichkeit.</p><p>Zimmer im Ikos Olivia bieten Flachbildfernseher, Klimaanlage und Minibar und Gäste können mit dem kostenlosen WLAN in Kontakt bleiben.</p><p>Darüber hinaus können die Gäste während ihres Aufenthalts im Ikos Olivia den Concierge und den Zimmerservice in Anspruch nehmen.  Während Ihres Aufenthalts im Ikos Olivia können Sie gerne den Pool und das Frühstück besuchen. Benötigen Sie einen Parkplatz? Am Ikos Olivia ist ein kostenloser Parkplatz verfügbar.</p><p>Wenn Sie nach einem griechischen Restaurant suchen, probieren Sie doch Anemomilos Restaurant, Elia oder 4 Epohes, die alle nicht weit vom Ikos Olivia entfernt sind.</p><p>K

The individual elements of the list each contain a rating text for the hotel, but other HTML tags are also included. In order to get the pure text, we apply the "get_text" method to each element and can use it to output the rating texts.

In [8]:
for i in reviews:
    print(i.get_text())

Das Ikos Olivia ist eine ausgezeichnete Wahl, wenn Sie Gerakini besuchen möchten. Die Unterkunft bietet ein familienfreundliches Umfeld mit vielen Annehmlichkeiten für Reisende und überzeugt außerdem durch die ideale Kombination aus Preis-Leistung, Komfort und Bequemlichkeit.Zimmer im Ikos Olivia bieten Flachbildfernseher, Klimaanlage und Minibar und Gäste können mit dem kostenlosen WLAN in Kontakt bleiben.Darüber hinaus können die Gäste während ihres Aufenthalts im Ikos Olivia den Concierge und den Zimmerservice in Anspruch nehmen.  Während Ihres Aufenthalts im Ikos Olivia können Sie gerne den Pool und das Frühstück besuchen. Benötigen Sie einen Parkplatz? Am Ikos Olivia ist ein kostenloser Parkplatz verfügbar.Wenn Sie nach einem griechischen Restaurant suchen, probieren Sie doch Anemomilos Restaurant, Elia oder 4 Epohes, die alle nicht weit vom Ikos Olivia entfernt sind.Komfort und Zufriedenheit der Gäste stehen im Ikos Olivia an erster Stelle und die Unterkunft freut sich, Sie in Ge

The rating texts for the hotel are now available and can be stored in a database, for example, and used for analyses. A sentiment analysis is particularly useful here to determine the tonality of the texts.

If you want to read out several pages, you call them up one after the other and extract the results. With many pages, this can become confusing, or you want to read in all pages of a certain URL structure. In practice, this is where other tools such as "Scrapy" come into play.