# Introduction to Web Scraping with Python

This notebook introduces the basic tools for web scraping with Python:
- Accessing a webpage
- Extracting source code from a webpage (HTML)
- Parsing and navigating HTML with `BeautifulSoup`

## Accessing the internet with Python

The package `requests` can be used to send requests over the internet. 

When visiting a webpage, you are sending a "get" request to the server where the webpage is hosted. 

In Python, a get request can be send with `requests.get(url)`. This returns a request object (or a class) containing various attributes like the status code, headers and content.

In the code below, we send a request to the news overview for the EU's Climate Action section (https://ec.europa.eu/clima/news_en).

In [1]:
import requests # Importing the package

response = requests.get("https://ec.europa.eu/clima/news_en")

`response` is now a request object containing various information of that request.

### Checking the request

To check if the request was successful, we can check the status code by inspecting the attribute `.status_code`:

In [2]:
response.status_code

200

Status code 200 means "OK"; that our request was succesul. 

This can be verified by checking the attribute `.reason`:

In [3]:
response.reason

'OK'

In [4]:
print(response.status_code, response.reason)

200 OK


**Quick note on status codes**

- Status codes beginning with 2 or 3: The request is successful
- Status codes beginning with 4: The request has failed (client-side, fx 404 when specifying a URL that does not exist on a given domain).
- Status codes beginning with 5: The request has failed (server-side)

Status codes can be used in code to check whether or not a site is reached before scraping.

### Content of a webpage

The raw source code from a webpage can be extracted from the attribute `.content`.

In [5]:
content = response.content
print(content[0:1000]) # Printing first 1000 characters

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="shortcut icon" href="https://ec.europa.eu/clima/sites/clima/themes/clima_theme/favicon.ico" type="image/vnd.microsoft.icon" />\n<meta name="description" content="Climate Action -" />\n<meta name="keywords" content="European Commission, European Union, EU" />\n<meta name="robots" content="follow, index" />\n<meta name="generator" content="Drupal 7 (https://drupal.org)" />\n<link rel="canonical" href="https://ec.europa.eu/clima/news_en" />\n<link rel="shortlink" href="https://ec.europa.eu/clima/news_en" />\n<meta http-equiv="content-language" content="en" />\n<meta name="revisit-after" content="15 days" />\n<meta property="og:site_name" content="Climate Action - European Commission" />\n<meta property="og:type" content="website" />\n<meta property="og:url" content="https://ec.europa.eu/clima/news_en" />\n<meta property="og:title" content="News" />\n<meta proper

With this raw source code, one *could* process this as is using something like regular expression to find the relevant parts of the source code.

However, HTML has a certain structure. This can be utilized to extract specific information from a webpage.

## A quick introduction to HTML

Instead of processing the HTML as raw text, we can utilize the structure of HTML to extract specific parts of a webpage.

This requires some knowledge of what HTML is and how it is structured.

HTML is short for "Hyper-Text Markup Language". It is used on webpages to give the pages their structure.

HTML is structured in "tags" denoted by `<>` and `</>`. The tags denote what kind of content it is. `<p>` is for example a paragraph tag. A piece of HTML like: `<p> This is a paragraph </p>` will render the sentence "This is a paragraph" as a paragraph. Common tags include `h1` for headings (and `h2`, `h3` and so on), `a` for links and `div` for a "division" or "section".

HTML is structured in a tree-like structure. Tags are therefore usually located within other tags. Tags on the same level are refered to as "siblings", tags inside other tags are refered to as "children" and tags outside other tags are refered to as "parents".

HTML uses "attributes" to both differentiate between the same type of tags and to add other variables/information to the tag. The `id` attribute is fx used to give several tags a common id. `class` is used to differentiate between different tags and provide them with different stylings. A common and useful attribute is `href` which contain the link that a hyperlink is refering to.

```
    <html>
        <body>
            <div id="convo1">
                <p class="kenobi">Hello There!</p>
            </div>
            <div id="convo2">
                <p class="grievous">General Kenobi!</p>
            </div>
            <div id="convo3">
                <p class="kenobi">So Uncivilized!</p>
            </div>
        </body>
    </html>
```    


The code above is an example of HTML code. Rendered as a webpage it would only contain the text within the tags:

```
Hello There!

General Kenobi!

So Uncivilized!
```

The structure and the tags of the HTML allows us to extract only specific parts of the code. This is because the structure and the tags makes certain part of the code uniquely identifiable. For example:

- The text "Hello There!" is located within a p tag with the class "kenobi". 
- The p tag containing the text "Hello There!" is located within the div tag with id "convo1" (tags located inside other tags are refered to as "children")
- The div tag with id "convo1" is located next to another div tag with id "convo2" (tags located next to each other or on the same level are refered to as "siblings")

Combining the information, we can uniquely refer to the tag containing "Hello There!" by specifying that we want a p tag with class "kenobi" that is a child of a div tag with id "convo1".

## Parsing HTML with BeautifulSoup

The package "BeautifulSoup" (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is developed specifically to navigate and parsing HTML (and XML) code. It works by converting HTML code to a "soup-object" wherein specific parts of the HTML can be extracted by refering to specific tags or paths.

The code below converts the HTML from before to a soup object using the function `bs`, which is a shorthand for the function `BeautifulSoup` imported from `bs4`:

In [6]:
from bs4 import BeautifulSoup as bs

html = '<html><body><div id="convo1"><p class="kenobi">Hello There!</p></div><div id="convo2"><p class="grievous">General Kenobi!</p></div><div id="convo3"><p class="kenobi">So Uncivilized!</p></div></body></html>'
soup = bs(html, "html.parser") # The second arguement specifies the parser to use; how the code should be interpreted
print(soup.prettify()) # Prints the HTML

<html>
 <body>
  <div id="convo1">
   <p class="kenobi">
    Hello There!
   </p>
  </div>
  <div id="convo2">
   <p class="grievous">
    General Kenobi!
   </p>
  </div>
  <div id="convo3">
   <p class="kenobi">
    So Uncivilized!
   </p>
  </div>
 </body>
</html>


When printed with `.prettify()` it looks like the same text but we are now able to navigate it using the tags.

### Finding tags

The methods `.find()` and `.find_all()` are used to find the first match and all matches respectively. The first argument of the method is the tag. Other arguments can then be added to make the search more specific.

Note that `.find()` and `.find_all()` are methods tied to a soup object, so they have to be used with some object returned from `bs` (in this case the object `soup` created earlier).

In [7]:
soup.find("p") # Finds the first p tag

<p class="kenobi">Hello There!</p>

`.find()` returns a new soup object with the HTML in the first matched tag.

In [8]:
soup.find_all("p") # Finds all p tags (returned as a list)

[<p class="kenobi">Hello There!</p>,
 <p class="grievous">General Kenobi!</p>,
 <p class="kenobi">So Uncivilized!</p>]

`.find_all()` returns a list of soup objects with the HTML in the matched tags.

The method `.get_text()` extracts the actual textual content within the tag (between `<p>` and `</p>` in this case):

In [9]:
soup.find("p").get_text()

'Hello There!'

`.get_text()` works on a soup object and therefore not on returns from `find_all()`, as that returns a list. To extract the text from the contents of a list returned from `find_all()`, we have to iterate over the list elements (fx with a for loop):

In [10]:
for tag in soup.find_all("p"):
    print(tag.get_text())

Hello There!
General Kenobi!
So Uncivilized!


### Using attributes to find tags

In addition to searching for tags, we can also specify attributes. Some attributes have arguments specific for them like id and class.

In [11]:
soup.find("div", id = "convo1").get_text() # Search for a specific id attribute

'Hello There!'

Notice that `.get_text()` extracts *all* text within the tag including text within child tags.

Search for class attribute (notice the `_` added to `class_` as the `class` name is reserved somewhere else in Python):

In [12]:
soup.find("p", class_ = "kenobi").get_text() # Search for a specific class attribute

'Hello There!'

Tags can also be found by searching for the attribute alone:

In [13]:
soup.find(class_ = "kenobi").get_text()

'Hello There!'

BeautifulSoup supports a wide range of attributes (id, href, class). There are however no real rules as to what attributes can be called in HTML. BeautifulSoup therefore supports searching for any attribute with the following syntax:

`attrs = {"attribute": "value"}`

In [14]:
soup.find(attrs = {"class": "kenobi"}).get_text()

'Hello There!'

#### Knowledge check:

What tags or attributes can be used to extract the text "General Kenobi"?

In [15]:
print(soup.prettify())

<html>
 <body>
  <div id="convo1">
   <p class="kenobi">
    Hello There!
   </p>
  </div>
  <div id="convo2">
   <p class="grievous">
    General Kenobi!
   </p>
  </div>
  <div id="convo3">
   <p class="kenobi">
    So Uncivilized!
   </p>
  </div>
 </body>
</html>


### Expanding search using regex

Attribute values can be long and sometimes adhere to a structure, where we want to find all attributes starting with some value. 

Instead of passing an exact string match as an arguement for `.find()`, one can instead parse a compiled regular expression pattern to search for.

We will not fully explain regular expression here but put shortly, regular expressions is a syntax for writing patterns that can match text strings. Instead of searching specifically for "kenobi", one could search for a pattern like starting with "ken" (`"^ken"`), ends with "obi" (`".*obi$"`) or contains six letters (`"\w{6}"`).

Regular expressions can be compiled using `re.compile(pattern)`. This pattern can the be used in `.find()` and `.find_all()`.

In [16]:
import re

soup.find(class_=re.compile("^gri")) # Search for tags with a class attribute starting with "gri"

<p class="grievous">General Kenobi!</p>

### Search for specific text

The `.find()` and `.find_all()` methods have a `string = ` arguement to search for specific strings in the text of the HTML. Regular expressions can be used here as well.

In [17]:
soup.find(string = re.compile("Hello"))

'Hello There!'

### Navigating the HTML structure

Using `.find()` returns a new soup object (`.find_all()` a list of soup objects). Because these methods search for tags *within* the soup object, it is always child tags of the original soup that is returned.

This allows one to parse further by first specifying one tag and then another:

In [18]:
soup_child = soup.find("div")

soup_grandchild = soup_child.find("p")

print(soup_grandchild)

<p class="kenobi">Hello There!</p>


It also allows one to navigate the structure, as the extracted soup objects maintains references to the HTML structure that it was extracted from.

Using `.parent`, one can locate the tag in which a certain tag is located:

In [19]:
soup_child = soup.find("p", class_ = "kenobi")

print(soup_child)

print(soup_child.parent) # Returns the parent of soup_child (a div tag in this case)

<p class="kenobi">Hello There!</p>
<div id="convo1"><p class="kenobi">Hello There!</p></div>


You can also iterate over all parents (and grand parents, so to speak) with `.parents`:

In [20]:
for parent in soup_child.parents:
    print(parent.name)

div
body
html
[document]


Using `.next_sibling` and `.previous_sibling` you can navigate between tags on the same level:

In [21]:
soup_child = soup.find("div")

print(soup_child)

print(soup_child.next_sibling) # Returns the next tag on the same level as soup_child

<div id="convo1"><p class="kenobi">Hello There!</p></div>
<div id="convo2"><p class="grievous">General Kenobi!</p></div>


## Finding the right tags

Let us try applying some of these skills on the European Union Climate Action news section.

We already know how to get the HTML, so this just has to be converted to a soup object, and we are ready to go:

In [22]:
response = requests.get("https://ec.europa.eu/clima/news_en")

eu_html = response.content

eu_soup = bs(eu_html, "html.parser")

Finding the right tags by just browing through raw HTML is not ideal.

Instead we can use our browser to help us find the parts of the webpage to extract. Almost all browsers has an "inspector tool" of some kind that allows one to inspect the source code of a webpage (shortcut `F12` for a lot of browsers on Windows and `Command-Option-I` for Safari on Mac).

## Extracting news headlines from EU Climate Action News

Inspecting the HTML of https://ec.europa.eu/clima/news_en, we see that the headlines are part of an "a" tag within a span tag with the class "views-field". This class is however not unique. Going up a level further, there is another span tag with the class "views-field-title", which does seem to be unique for the headlines.

We can extract the first headline as follows:

In [23]:
news_title_soup = eu_soup.find("span", class_ = "views-field-title").find("a") # The find methods are chained; first span, then a

print(news_title_soup.get_text())

Further action required to meet 2020 fuel quality targets despite 3.7% drop in greenhouse gas intensity since 2010


Note that the span tag in question actually contains two classes: "views-field" and "views-field-title". HTML class names cannot contain spaces, so when an HTML tag contains a class attribute that contains spaces, it is actually two classes. When specifying the class with `.find()` or `.find_all()`, we only have to specify one of them.

The headline is also a link. Links are almost always created as an "a" tag.

`news_title_soup` is currently the soup object with the "a" tag containing the headline. Supposing we want to collect the links to the articles to scrape the articles themselves, we can extract that directly from this soup object.

The URL linked is almost always stored as an "href" attribute in an "a" tag.

Attributes can be extracted directly from soup objects using `[attribute]`:

In [24]:
news_title_soup['href']

'/clima/news/further-action-required-meet-2020-fuel-quality-targets-despite-37-drop-greenhouse-gas-intensity_en'

Links can be either "absolute" or "relative". Absolute links contain the entire URL to access the page. A relative URL contains the path on the specific domain. 

In order to convert the output a both to a working URL, we have to add the main domain, which can be done via pasting:

In [25]:
print("https://ec.europa.eu" + news_title_soup['href'])

https://ec.europa.eu/clima/news/further-action-required-meet-2020-fuel-quality-targets-despite-37-drop-greenhouse-gas-intensity_en


### Extracting all headlines

Extracting all the titles will have to be done step-wise, as `.find_all()` cannot be chained the same way because `.find_all()` always returns a list.

In [26]:
span_soup = eu_soup.find_all("span", class_ = "views-field-title")
news_titles_soup = [soup.find("a") for soup in span_soup]
news_titles_soup = list(filter(None, news_titles_soup)) # Filtering empty

for title in news_titles_soup:
    print(title.get_text())

Further action required to meet 2020 fuel quality targets despite 3.7% drop in greenhouse gas intensity since 2010
Carbon Market Report: Emissions from EU ETS stationary installations fall by over 9%
Start of phase 4 of the EU ETS in 2021: adoption of the cap and start of the auctions
LIFE programme: over EUR 280 million in EU funding for environment, nature and climate action projects
Commission launches four public consultations in an important step towards climate neutrality
First Innovation Fund call for large-scale projects: 311 applications for the EUR 1 billion EU funding for clean tech projects
Commission to launch four public consultations in an important step towards climate neutrality
Commission sets Forest Reference Levels in a delegated act
EU recognises best nature, environment and climate action projects
Opening Remarks by Executive Vice-President Frans Timmermans at the European Parliament Plenary Session on the European Climate Law


We can also store the titles as a list:

In [27]:
title_list = [title_soup.get_text() for title_soup in news_titles_soup]
print(title_list)

['Further action required to meet 2020 fuel quality targets despite 3.7% drop in greenhouse gas intensity since 2010', 'Carbon Market Report: Emissions from EU ETS stationary installations fall by over 9%', 'Start of phase 4 of the EU ETS in 2021: adoption of the cap and start of the auctions', 'LIFE programme: over EUR 280 million in EU funding for environment, nature and climate action projects', 'Commission launches four public consultations in an important step towards climate neutrality', 'First Innovation Fund call for large-scale projects: 311 applications for the EUR 1 billion EU funding for clean tech projects', 'Commission to launch four public consultations in an important step towards climate neutrality', 'Commission sets Forest Reference Levels in a delegated act', 'EU recognises best nature, environment and climate action projects', 'Opening Remarks by Executive Vice-President Frans Timmermans at the European Parliament Plenary Session on the European Climate Law']


And the links:

In [28]:
for title in news_titles_soup:
    print(title['href'])

/clima/news/further-action-required-meet-2020-fuel-quality-targets-despite-37-drop-greenhouse-gas-intensity_en
/clima/news/carbon-market-report-emissions-eu-ets-stationary-installations-fall-over-9_en
/clima/news/start-phase-4-eu-ets-2021-adoption-cap-and-start-auctions_en
https://ec.europa.eu/commission/presscorner/detail/en/ip_20_2052
/clima/news/commission-launches-four-public-consultations-important-step-towards-climate-neutrality_en
/clima/news/first-innovation-fund-call-large-scale-projects-311-applications-eur-1-billion-eu-funding-clean_en
/clima/news/commission-launch-four-public-consultations-important-step-towards-climate-neutrality_en
/clima/news/commission-sets-forest-reference-levels-delegated-act_en
https://ec.europa.eu/environment/news/eu-recognises-best-nature-environment-and-climate-action-projects-2020-10-21_en
https://ec.europa.eu/commission/commissioners/2019-2024/timmermans/announcements/opening-remarks-executive-vice-president-frans-timmermans-european-parliament-

Stored as a list of links:

In [29]:
link_list = [title_soup['href'] for title_soup in news_titles_soup]
print(link_list)

['/clima/news/further-action-required-meet-2020-fuel-quality-targets-despite-37-drop-greenhouse-gas-intensity_en', '/clima/news/carbon-market-report-emissions-eu-ets-stationary-installations-fall-over-9_en', '/clima/news/start-phase-4-eu-ets-2021-adoption-cap-and-start-auctions_en', 'https://ec.europa.eu/commission/presscorner/detail/en/ip_20_2052', '/clima/news/commission-launches-four-public-consultations-important-step-towards-climate-neutrality_en', '/clima/news/first-innovation-fund-call-large-scale-projects-311-applications-eur-1-billion-eu-funding-clean_en', '/clima/news/commission-launch-four-public-consultations-important-step-towards-climate-neutrality_en', '/clima/news/commission-sets-forest-reference-levels-delegated-act_en', 'https://ec.europa.eu/environment/news/eu-recognises-best-nature-environment-and-climate-action-projects-2020-10-21_en', 'https://ec.europa.eu/commission/commissioners/2019-2024/timmermans/announcements/opening-remarks-executive-vice-president-frans-ti

If we wanted to save this list of links as a .txt file, we can write the following:

In [30]:
with open("eu_climate_news_links.txt", 'w') as file: # This line creates a text file in "write" mode
    for line in link_list: # Iterating over each line in the list (each link)
        file.write(line + "\n") # Each link is written to the file followed by a newline (\n)

## EXERCISE: Extracting information from EU Climate Action News

Using the right tags and attributes for search, extract the following from the EU Climate Action News (https://ec.europa.eu/clima/news_en):

1. The dates of the news articles.

2. The summaries of the news articles.

3. The urls for the images used for the news articles.

If you are familiar with Python dictionaries and lists, see if you can collect the data in a format that allows you to easily find the summary for a specific article later.

### Extracting summaries

The summaries are inside a div tag with the class "views-field-field-summary". The summaries can therefore be extracted as follows:

In [31]:
summaries_soup = eu_soup.find_all("div", class_="views-field-field-summary")

summaries_text = [summary_soup.get_text() for summary_soup in summaries_soup]

print(summaries_text)

[' The Commission today adopted its 2018 Fuel Quality Report based on the data submitted by EU countries. According to the data provided, the average greenhouse gas intensity of fuels in the 28 reporting Member States had fallen by 3.7% compared to the 2010 baseline. The year-on-year progress achieved compared to 2017 was limited to a 0.3% decrease. Progress varied greatly across Member States, but almost all need to take swift action to meet the 2020 target of 6%.\n ', ' The European Commission has adopted its annual report on the functioning of the European carbon market. The report covers 2019 and certain developments in 2020.\n ', ' The Commission is finalising the preparations for the period 2021-2030 of the EU ETS (phase 4), starting on 1 January 2021.\n ', ' The European Commission has approved an investment package of more than EUR 280 million from the EU budget for over 120 new LIFE programme projects. This EU funding will trigger total investments of nearly EUR 590 million to

### Collected the data in a structured format

To make it easier to work with the data later on, we can extract the information and structure it in some sensible format.

In the following, the title, link, date, summary and image URL of each news article is stored as a dictionary (`article_dict`). The articles are gathered in a list.

This format is essentially a list of JSONs.

In [32]:
article_rows_soup = eu_soup.find_all("div", class_ = "views-row")

article_list = []

for row in article_rows_soup:
    article_dict = {}
    
    article_title_soup = row.find("span", class_ = "views-field-title").find("a")
    article_title = article_title_soup.get_text()
    article_link = article_title_soup['href']
    
    article_date = row.find("span", class_ = "date-display-single").get_text()
    
    article_summary_soup = row.find("div", class_ = "views-field-field-summary")
    try:
        article_summary = article_summary_soup.get_text(strip = True)
    except:
        article_summary = ""
    
    article_imgurl = row.find("div", class_ = "views-field-field-nems-core-image").find("img")['src']
    
    article_dict['title'] = article_title
    article_dict['link'] = article_link
    article_dict['date'] = article_date
    article_dict['summary'] = article_summary
    article_dict['imgurl'] = article_imgurl
    
    article_list.append(article_dict)

Data is now stored as a list of dictionaries: Each list element is a dictionary with the keys title, link, date, summary and imgurl.

In [33]:
article_list[0]

{'title': 'Further action required to meet 2020 fuel quality targets despite 3.7% drop in greenhouse gas intensity since 2010',
 'link': '/clima/news/further-action-required-meet-2020-fuel-quality-targets-despite-37-drop-greenhouse-gas-intensity_en',
 'date': '19/11/2020',
 'summary': 'The Commission today adopted its2018 Fuel Quality Reportbased on the data submitted by EU countries. According to the data provided, the average greenhouse gas intensity of fuels in the 28 reporting Member States had fallen by 3.7% compared to the 2010 baseline. The year-on-year progress achieved compared to2017was limited to a 0.3% decrease. Progress varied greatly across Member States, but almost all need to take swift action to meet the 2020 target of 6%.',
 'imgurl': 'https://ec.europa.eu/clima/sites/clima/files/styles/news-events/public/news/images/20180206.jpg?itok=suQn3NdR'}

This format can be converted to a pandas data frame with `pd.DataFrame.from_records()`

In [34]:
import pandas as pd

eu_df = pd.DataFrame.from_records(article_list)

In [35]:
eu_df.head()

Unnamed: 0,title,link,date,summary,imgurl
0,Further action required to meet 2020 fuel qual...,/clima/news/further-action-required-meet-2020-...,19/11/2020,The Commission today adopted its2018 Fuel Qual...,https://ec.europa.eu/clima/sites/clima/files/s...
1,Carbon Market Report: Emissions from EU ETS st...,/clima/news/carbon-market-report-emissions-eu-...,18/11/2020,The European Commission has adopted its annual...,https://ec.europa.eu/clima/sites/clima/files/s...
2,Start of phase 4 of the EU ETS in 2021: adopti...,/clima/news/start-phase-4-eu-ets-2021-adoption...,17/11/2020,The Commission is finalising the preparations ...,https://ec.europa.eu/clima/sites/clima/files/s...
3,LIFE programme: over EUR 280 million in EU fun...,https://ec.europa.eu/commission/presscorner/de...,16/11/2020,The European Commission has approved an invest...,https://ec.europa.eu/clima/sites/clima/files/s...
4,Commission launches four public consultations ...,/clima/news/commission-launches-four-public-co...,13/11/2020,The European Commission today launched four op...,https://ec.europa.eu/clima/sites/clima/files/s...
