<a href="https://colab.research.google.com/github/MCanela-1954/DataSci_Course/blob/main/%5BDATA-03%5D%20Web%20scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [DATA-03] Web scraping

## What is HTML?

**HTML** (Hypertext Markup Language) is the language in which are written the documents designed to be displayed in a web browser. The web browser receives an HTML document from a web server or from local storage and renders it as a multimedia web page. That HTML document is then called the **page source**.

HTML is assisted by two technologies:

* **CSS** (Cascading Style Sheets) is a language used to describe the **style** of HTML documents.

* **JavaScript** is a **scripting language**, that is, one for integrating and communicating with other languages. Scripting languages are used for small jobs. The source of a dynamic web page typically contains JavaScript scripts to perform actions such as accepting cookies or asking for more information.

##Example

An extremely simple example of an HTML document follows. It is easy to see, in this example, why HTML is called a **markup language**. The markup, consisting here of the **tags** `<head>`, `<body>`, `<title>`, `<div>` and `<a>`, is used to create a structure in the document and to include **links** to other files.

```
<html>
<head>
	<title>Data Viz</title>
</head>
<body>
	<div class="course">Data Visualization</div>
	<div class="program">MBA full-time</div>
	<a class="professor" href="faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a>
</body>
</html>
```

Unfortunately, in an HTML document captured from Internet, you will not find such a friendly presentation, with one line for each tag, and indentation to help you see the structure of the document. But you can find in Internet various tools for rendering HTML documents in this form.

## Tags and attributes

The structure of an HTML document is made by the tags. Every part of the document is opened by a **start tag** (`<tag>`) and closed by an **end tag** (`</tag>`). These parts are called **HTML elements**. The tags create a tree-like structure in the document, with HTML elements nested within HTML elements. The representation of the HTML document as a logical tree is called the **Document Object Model** (DOM).

*Note*. Though we may insert white space between consecutive elements to make the document readable, as in the example below, the white space between tags that belong to different elements is ignored by the HTML interpreter.

The tag `<html>` tells the browser that this is an HTML document. The `html` element is the whole document. It has two **child elements**, `head` and `body`. An HTML document is always split in this way. In the example, the `head` element has one **child**, while the `body` element has three children, which are **siblings**.

Then, the `title` element contains the string `'Data Viz'`, enclosed between the start tag and the end tag (this can also be said of the `head` element). This string is referred to as **text**. Also, most of the start tags have **attributes**. In our example, the `div` elements have one `class` attribute, while the `a` element has two attributes, a `class` attribute and a `href` attribute. `class` attributes, which specify one or more `class` names for some elements of the HTML document, are very frequent. The value of a `class` attribute can be used by CSS to provide style and by JavaScript to perform certain tasks for the elements with that `class` value.

The `a` tags have a special role, to mark hyperlinks. A **hyperlink** is used to link a page to another page, or to download a file. The most important attribute of an `a` element is the `href` attribute, which indicates the link's destination.

There are other HTML tag names, not used in the above example. In this course, they will be explained as they appear in other examples.

## What is Beautiful Soup?

**Beautiful Soup** is a Python package for extracting data from HTML files. Other packages, like **scrapy**, provide more adavanced toolkits, but, since Beautiful Soup is much friendlier, most **web scraping** practitioners start there.

Beautiful Soup is available in Colab notebooks, but you may have to install it if you are working in your computer.This can be done by running in the shell (or in a Jupyter app) the command `pip install bs4`. When the package is already installed, the recommended import style is:

In [None]:
from bs4 import BeautifulSoup

This allows us to use the function `BeautifulSoup()`, which can be applied to any string containing HTML code. `BeautifulSoup()` **parses** the HTML code, learning the tree structure encoded there, which is then stored in a **soup object**. Let us see how this works in our example.

In web scraping projects, HTML documents are captured from Internet, as shown later in this note. For this brief tutorial, we create a string variable, whose value is the HTML document. The triple quote mark stops the Python interpreter having trouble with the line breaks.

In [None]:
html_str = '''
<html>
  <head>
    <title>Data Viz</title>
  </head>
  <body>
    <div class="course">Data Visualization</div>
    <div class="program">MBA full-time</div>
    <a class="professor" href="faculty-research/faculty/miguel-angel-canela">Miguel Ángel Canela</a>
  </body>
</html>
'''

## Parsing HTML code

To parse the string `html_str`, learning the tree structure, we enter:

In [None]:
soup = BeautifulSoup(html_str, 'html.parser')

`BeautifulSoup()` returns a `BeautifulSoup`
 object, which is a data structure representing a parsed HTML document. This data structure stores the contents of the string `html_str` in a way that the different HTML elements can be extracted. To get this, it uses a **parser**, which is a program which splits the string in substrings, based on the tags.

Beautiful Soup does not come with a parser. The default option chooses among those available to the current Python kernel, following internal rules. The actual choice depends on the packages already installed in your computer (or cloud computing provider). If `html.parser` is specified, the choice is the parser provided by the Python Standard Library. Since this is a rather technical issue, we follow here the recommended practice.

In [None]:
type(soup)

The contents of `soup` can be displayed (don't do this for the source of a real web page, which will be too big to be read on the screen):

In [None]:
soup

The same is true por the elements contained in `soup`, as we will see below. But we have to see first how to extract this elements from the soup.


## The tree structure

Once you have a soup, you can easily explore the the content. For instance:

In [None]:
soup.head

In [None]:
type(soup.head)

Though, formally, `BeautifulSoup` and `Tag` are different types, a tag works in practice as a smaller soup, so you can extract elements within elements:

In [None]:
soup.head.title

If you ask for a nonexisting element, you get `None`:

In [None]:
soup.head.div

When there are several elements satisfying the requirements, you get the first one. This is the logic of Beautiful Soup:

In [None]:
soup.div

The simplest way to extract tags from the soup is based on the methods `.find()` and `.find_all()`:

* `.find_all()` returns a list containing all the HTML elements that satisfy a specification (eventually empty).

* `.find()` returns only the first one (or `None`, if there is no element satisfying the specification).

Let us see how to use these methods in our example.

## The method .find()

A first example of `.find()` follows.

In [None]:
soup.find('div')

Note that there are two `div` elements in this soup, and `.find()` has extracted the first one. But we can use the attribute values to distinguish among elements with the same tag name:

In [None]:
soup.find('div', attrs={'class': 'program'})

For the attribute `class`, this is can be shortened, as:

In [None]:
soup.find('div', 'program')

Since a tag works as a smaller soup, you can iterate the method `.find()`:

In [None]:
soup.find('head').find('title')

## The method .find_all()

The method `.find_all()` uses the same syntax as `.find()` but, instead of a single element, it returns a list with all the elements that satisfy the specification:

In [None]:
soup.find_all('div')

Note that `.find_all()` *always* returns a list. The list can be empty (`.find()` would return `None` in that case).

In [None]:
soup.find('head').find_all('div')

When there is only one element in the list, that element is precisely the one returned by `.find()`:


In [None]:
soup.find_all('div', 'course')

Note that, even when there is exactly one element in the list returned by `.find_all()`, you have to extract it from the list:

In [None]:
soup.find_all('div', 'course')[0]

## Extracting information from an HTML element

The information that we wish to extract from an HTML element can come as the text enclosed by the start tag and the end tag, or as the value of an attribute. With `.string`, we can extract the text enclosed by the tags:

In [None]:
soup.find('a').string

Note that this method cannot be applied directly to a list returned by `.find_all()`:


In [None]:
soup.find_all('div').string

But we can use a **comprehension list** to extract the text from every item of the list and store it in a new list:

In [None]:
[t.string for t in soup.find_all('div')]

In certain cases, we are interested in the value of an attribute. A frequent example is that of an `a` element with a `href` attribute whose value is a relevant link. The link is then extracted as:


In [None]:
soup.find('a')['href']

## What is web scraping?

**Web scraping** is concerned with extracting data from websites, in particular data that would be difficult to get on a large scale using traditional data collection methods. There is a whole industry built around web scraping, as it is used to track product price changes or discounts, to gather data from social profiles, to capture real estate listings, in search engine optimization (SEO), etc.

Scraping a web page involves downloading the page and extracting data from it. Both things can be done in many ways, in particular with Python tools. There are also commercial web scraping applications, such as **Apify** and **Octoparse**. This course uses the Python packages **Requests** and **Beautiful Soup**.

## HTML and the browser

Suppose that your browser (let us assume that you use Google Chrome) is displaying a web page on the screen. Right-clicking anywhere on the page opens a contextual menu. Then, selecting the *View Page Source* option, a new tab will open, displaying a HTML document. In the simplest case, which is the one covered in this lecture, this HTML document corresponds to the page that the browser was displaying.

But not all pages are that simple. Some use a technology called **AJAX** (Asynchronous JavaScript And XML) in a two-step process as follows:

1. The page corresponding to the URL that you enter is loaded.

2. A JavaScript program creates a `XMLHttpRequest` object.

3. The `XMLHttpRequest` object sends a request to a web server.

4. The server sends a new HTML document back to the browser, which the browser displays on the screen. This second document corresponds to the page that you are actually watching.

The tools provided by the Python package **Requests** can only capture the first page, which is not always the one from which you wish to scrape the information. To get the second one, web scrapers use a tool called **Selenium**, not covered in this course.

Also in the contextual menu of the browser, the option *Inspect*  can help you to find the HTML code chunk corresponding to a specific part of a web page. Right-clicking on the part of the page currently displayed by the browser on which you are interested, and selecting *Inspect*, the screen is split, leaving the web page on one side and displaying on the other side the panel *Developer Tools*, which provides many choices: *Elements*, *Console*, *Network*, etc. The first one contains the page's DOM tree and gives you full access to the source code of the page currently displayed, which may be different from the one you called, as explained above. The element of the page on which you have clicked appears highlighted.

## The package Requests

In Python, files can be downloaded from Internet sources in multiple ways. Though some sources still recommend the package `urllib`, which is part of the Python Standard Library. But, nowadays, Requests, available in Colab notebooks and the Anaconda distribution, is the favorite choice of the practitioners.

Let us refresh the context. Through the browser, you can access to a resource by specifying a **Uniform Resource Locator** (URL). At the beginning of the URL, we find the the protocol used to access the resource, followed by a colon and two forward slashes. This is usually HTTPS, a secure version of HTTP.

The **Hypertext Transfer Protocol** (HTTP) was designed to enable communications between clients and servers. For instance, a client (such as your browser) sends a **HTTP request** to the server. Then, the server returns the response to the client. The response contains status information about the request and, if the request is accepted, the requested content.

**GET** is one of the most common HTTP methods. It is used to request data from a specified resource. The Requests function `get()` is a Python implementation. You can manage this as follows:

```
import requests
html_str = requests.get(url).text
```

`get()` returns a `requests` object (type `requests.models.Response`), containing data about the request. The attribute `text` of this object is a string which, for an ordinary web page, is the HTML source code. Then, you can extract the information sought as explained in this note. That information, after some cleaning, can be exported to your preferred data format.