# Introduction to Web Scraping with Python

This notebook introduces the basic tools for web scraping with Python:
- Accessing a webpage
- Extracting source code from a webpage (HTML)
- Parsing and navigating HTML with `BeautifulSoup`

## Accessing the internet with Python

The package `requests` can be used to send requests over the internet. 

When visiting a webpage, you are sending a "get" request to the server where the webpage is hosted. 

In Python, a get request can be send with `requests.get(url)`. This returns a request object (or a class) containing various attributes like the status code, headers and content.

In the code below, we send a request to the news overview for the EU's Climate Action section.

`response` is now a request object containing various information of that request.

### Checking the request

To check if the request was successful, we can check the status code by inspecting the attribute `.status_code`:

Status code 200 means "OK"; that our request was succesul. 

This can be verified by checking the attribute `.reason`:

**Quick note on status codes**

- Status codes beginning with 2 or 3: The request is successful
- Status codes beginning with 4: The request has failed (client-side, fx 404 when specifying a URL that does not exist on a given domain).
- Status codes beginning with 5: The request has failed (server-side)

Status codes can be used in code to check whether or not a site is reached before scraping.

### Content of a webpage

The raw source code from a webpage can be extracted from the attribute `.content`.

With this raw source code, one *could* process this as is using something like regular expression to find the relevant parts of the source code.

However, HTML has a certain structure. This can be utilized to extract specific information from a webpage.

## A quick introduction to HTML

Instead of processing the HTML as raw text, we can utilize the structure of HTML to extract specific parts of a webpage.

This requires some knowledge of what HTML is and how it is structured.

HTML is short for "Hyper-Text Markup Language". It is used on webpages to give the pages their structure.

HTML is structured in "tags" denoted by `<>` and `</>`. The tags denote what kind of content it is. `<p>` is for example a paragraph tag. A piece of HTML like: `<p> This is a paragraph </p>` will render the sentence "This is a paragraph" as a paragraph. Common tags include `h1` for headings (and `h2`, `h3` and so on), `a` for links and `div` for a "division" or "section".

HTML is structured in a tree-like structure. Tags are therefore usually located within other tags. Tags on the same level are refered to as "siblings", tags inside other tags are refered to as "children" and tags outside other tags are refered to as "parents".

HTML uses "attributes" to both differentiate between the same type of tags and to add other variables/information to the tag. The `id` attribute is fx used to give several tags a common id. `class` is used to differentiate between different tags and provide them with different stylings. A common and useful attribute is `href` which contain the link that a hyperlink is refering to.

```
    <html>
        <body>
            <div id="convo1">
                <p class="kenobi">Hello There!</p>
            </div>
            <div id="convo2">
                <p class="grievous">General Kenobi!</p>
            </div>
            <div id="convo3">
                <p class="kenobi">So Uncivilized!</p>
            </div>
        </body>
    </html>
```    


The code above is an example of HTML code. Rendered as a webpage it would only contain the text within the tags:

```
Hello There!

General Kenobi!

So Uncivilized!
```

The structure and the tags of the HTML allows us to extract only specific parts of the code. This is because the structure and the tags makes certain part of the code uniquely identifiable. For example:

- The text "Hello There!" is located within a p tag with the class "kenobi". 
- The p tag containing the text "Hello There!" is located within the div tag with id "convo1" (tags located inside other tags are refered to as "children")
- The div tag with id "convo1" is located next to another div tag with id "convo2" (tags located next to each other or on the same level are refered to as "siblings")

Combining the information, we can uniquely refer to the tag containing "Hello There!" by specifying that we want a p tag with class "kenobi" that is a child of a div tag with id "convo1".

## Parsing HTML with BeautifulSoup

The package "BeautifulSoup" (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is developed specifically to navigate and parsing HTML (and XML) code. It works by converting HTML code to a "soup-object" wherein specific parts of the HTML can be extracted by refering to specific tags or paths.

The code below converts the HTML from before to a soup object:

In [1]:
from bs4 import BeautifulSoup as bs

html = '<html><body><div id="convo1"><p class="kenobi">Hello There!</p></div><div id="convo2"><p class="grievous">General Kenobi!</p></div><div id="convo3"><p class="kenobi">So Uncivilized!</p></div></body></html>'
soup = bs(html, "html.parser") # The second arguement specifies the parser to use; how the code should be interpreted
print(soup.prettify()) # Prints the HTML

<html>
 <body>
  <div id="convo1">
   <p class="kenobi">
    Hello There!
   </p>
  </div>
  <div id="convo2">
   <p class="grievous">
    General Kenobi!
   </p>
  </div>
  <div id="convo3">
   <p class="kenobi">
    So Uncivilized!
   </p>
  </div>
 </body>
</html>


When printed with `.prettify()` it looks like the same text but we are now able to navigate it using the tags.

### Finding tags

The methods `.find()` and `.find_all()` are used to find the first match and all matches respectively. The first argument of the method is the tag. Other arguments can then be added to make the search more specific.

The method `.get_text()` extracts the actual textual content within the tag (between `<p>` and `</p>` in this case):

### Using attributes to find tags

Search for id attribute:

Notice that `.get_text()` extracts *all* text within the tag including text within child tags.

Search for class attribute (notice the `_` added to `class_` as the `class` name is reserved somewhere else in Python):

Tags can also be found by searching for the attribute alone:

BeautifulSoup supports a wide range of attributes (id, href, class). There are however no real rules as to what attributes can be called in HTML. BeautifulSoup therefore supports searching for any attribute with the following syntax:

`attrs = {"attribute": "value"}`

#### Knowledge check:

What tags or attributes can be used to extract the text "General Kenobi"?

In [2]:
print(soup.prettify())

<html>
 <body>
  <div id="convo1">
   <p class="kenobi">
    Hello There!
   </p>
  </div>
  <div id="convo2">
   <p class="grievous">
    General Kenobi!
   </p>
  </div>
  <div id="convo3">
   <p class="kenobi">
    So Uncivilized!
   </p>
  </div>
 </body>
</html>


### Expanding search using regex

Attribute values can be long and sometimes adhere to a structure, where we want to find all attributes starting with some value. 

Instead of passing an exact string match as an arguement for `.find()`, one can instead parse a compiled regular expression pattern to search for.

We will not fully explain regular expression here but put shortly, regular expressions is a syntax for writing patterns that can match text strings. Instead of searching specifically for "kenobi", one could search for a pattern like starting with "ken" (`"^ken"`), ends with "obi" (`".*obi$"`) or contains six letters (`"\w{6}"`).

Regular expressions can be compiled using `re.compile(pattern)`. This pattern can the be used in `.find()` and `.find_all()`.

The code belows searches for tags with an attribute starting with "gri":

### Search for specific text

The `.find()` and `.find_all()` methods have a `string = ` arguement to search for specific strings. Regular expressions can be used here as well.

### Navigating the HTML structure

Using `.find()` returns a new soup object (`.find_all()` a list of soup objects). Because these methods search for tags *within* the soup object, it is always child tags of the original soup that is returned.

This allows one to parse further by first specifying one tag and then another:

It also allows one to navigate the structure, as the extracted soup objects maintains references to the HTML structure that it was extracted from.

Using `.parent`, one can locate the tag in which a certain tag is located:

You can also iterate over all parents (and grand parents, so to speak) with `.parents`:

Using `.next_sibling` and `.previous_sibling` you can navigate between tags on the same level:

## Finding the right tags

Let us try applying some of these skills on the European Union Climate Action news section.

We already know how to get the HTML, so this just has to be converted to a soup object, and we are ready to go:

Finding the right tags by just browing through raw HTML is not ideal.

Instead we can use our browser to help us find the parts of the webpage to extract. Almost all browsers has an "inspector tool" of some kind that allows one to inspect the source code of a webpage (shortcut `F12` for a lot of browsers).

## Extracting news headlines from EU Climate Action News

Inspecting the HTML of https://ec.europa.eu/clima/news_en, we see that the headlines are part of an "a" tag within a span tag with the class "field-content". This class is however not unique. Going up a level further, there is another span tag with the class "views-field-title", which does seem to be unique for the headlines.

We can extract the first headline as follows:

The headline is also a link. Links are always created as "a" tags with the URL linked stored as an "href" attribute.

Attributes can be extracted directly from soup objects using `[attribute]`:

Extracting all the titles will have to be done step-wise, as `.find_all()` cannot be chained the same way because `.find_all()` always returns a list.

And the links:

## EXERCISE: Extracting information from EU Climate Action News

Using the right tags and attributes for search, extract the following from the EU Climate Action News (https://ec.europa.eu/clima/news_en):

1. The dates of the news articles.

2. The summaries of the news articles.

3. The urls for the images used for the news articles.

If you are familiar with Python dictionaries and lists, see if you can collect the data in a format that allows you to easily find the summary for a specific article later.

![img](https://ec.europa.eu/clima/sites/clima/files/styles/news-events/public/news/images/20201116.jpg?itok=Fb3Zp__J)