# <center>Web Scraping</center>

<img src="../image/text_DALLE.jpeg" width=30% align="right" style="in-line">

>*Data is like garbage.*
>
>*You’d better know what you are going to do with it before you collect it.*
>
>— Mark Twain ? ( source: [Forbes](https://www.forbes.com/councils/forbestechcouncil/2023/05/09/the-delta-between-trust-and-usability-where-data-management-still-falls-short/) )

<img src="../image/quote1_ChatGPT.png" width=70% align="left" style="in-line">

## Agenda

1. Web page basics (see slides)
2. Web scraping with Python

<a name="2"></a>
## Agneda 2. Web scraping with Python

Sometimes webs scraping can be really easy, other times it can be complicated. 

- Easy: static HTML
- Hard: HTML and CSS
- Harder: Javascript - Often requires a "Headless" web browser

Let's start the web scraping. We will collect some news regarding mobility and transport.

This is the website: [European Commission - Mobility and Transport News](https://transport.ec.europa.eu/news-events/news_en?page=0).

&#x1F4D6; **<font color=teal>WHAT WE HAVE LEARNED: </font>Legal and ethical considerations** 
* **Terms of Use:** The European Commission allows the reuse of its content under certain conditions.
>Unless otherwise indicated (e.g. in individual copyright notices), content owned by the EU on this website is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence. This means that reuse is allowed, provided appropriate credit is given and changes are indicated.

     For educational purposes, reuse is usually permitted. Review [the Legal Notice](https://commission.europa.eu/legal-notice_en) to ensure compliance.
     

* **Robots.txt:** Check [the website's robots.txt](https://transport.ec.europa.eu/robots.txt) file to see any disallowed paths. We can see that https://transport.ec.europa.eu/news-events is not among the disallowed paths.

We will use a library called `requests` to  download web pages. The `requests` will make a [GET request](https://en.wikipedia.org/wiki/HTTP#Request_methods) to a web server, which will download the HTML contents of a given web page for us. And we will use a library called `BeautifulSoup` to parse the HTML document.

&#x270A; **<font color=firebrick>DO THIS: </font> Run the cell below to check if you have the libraries installed. If not, install them now.**

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = "https://transport.ec.europa.eu/news-events/news_en?page=0"

In [None]:
page = requests.get(url)

In [None]:
print(page)

After running our request, we get a Response object. This object has a status code, which shows us if the page was downloaded successfully. A status code of 200 means that the page was downloaded successfully.

&#x1F4A1; **HTTP status codes** (Source: [wikipedia](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes))

>* 1xx informational response – the request was received, continuing process
>* 2xx successful – the request was successfully received, understood, and accepted
>* 3xx redirection – further action needs to be taken in order to complete the request
>* 4xx client error – the request contains bad syntax or cannot be fulfilled
>* 5xx server error – the server failed to fulfil an apparently valid request

We now use `BeautifulSoup` to parse the page.

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
# page.content
# soup

In [None]:
# dir(soup)
# help(soup)

In [None]:
# print out the HTML content of the page
#print(soup)
print(soup.prettify()) #format the page nicely

The task now is to **locate the specific content that we want to scrape**. You can view the page structure in a browser (for example: in Chrome by clicking `View` -> `Developer` -> `Inspect Elements`).

Once locate the content, look for the tag and attribute of the target element.

&#x1F4D6; **<font color=teal>WHAT WE HAVE LEARNED: </font> HTML tags and attributes**

<img src="../image/HTML_element.png" width=50% align="left" >

There could be more than one way to locate a target element. 

&#x1F4A1; **HTML elements:** [documentation](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

In [None]:
news = soup.find_all("div", class_="ecl-content-item-block__item")
#news = soup.find_all("article", class_="ecl-content-item")

In [None]:
#print(news)
len(news)

In [None]:
print(news[0]) # the first news

#copy paste it in a Markdown cell here

<div class="ecl-content-item-block__item contextual-region ecl-u-mb-l ecl-col-12"><article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2025-10-27T12:00:00Z">27 October 2025</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone" href="/news-events/news/solidarity-lanes-latest-figures-september-2025-2025-10-27_en">Solidarity Lanes: Latest figures – September 2025</a></div><div class="ecl-content-block__description"><p>Latest figures on Ukrainian exports and imports via the EU-Ukraine Solidarity Lanes: new transport routes established in the face of Russia’s war of aggression against Ukraine.</p></div><ul class="ecl-content-block__secondary-meta-container"><li class="ecl-content-block__secondary-meta-item"><span class="wt-icon--clock ecl-icon ecl-icon--s ecl-content-block__secondary-meta-icon ecl-icon--clock"></span><span class="ecl-content-block__secondary-meta-label">3 min read</span></li></ul></div></article></div>

In [None]:
# a for loop to get all the titles
for item in news:
    title = item.find("a", class_="ecl-link ecl-link--standalone")
    print(title.get_text())
    print("====")

A note to myself: go back to the slides before the big task.

&#x270A; **<font color=firebrick>DO THIS: </font>** Here is an example project. We would like to find out what the European Union has done recently (let's say since 2024) to advance sustainable mobility and transport. One possible data source is the news we just scraped, but we need more information other than the title of the news.

So now please write some code to collect **the date, the title, the short description, the news type, and the link to the full text** of all news in 2024. Save the data to a **csv** file.

Here are some tips:
1. How many pages do you need to scrape? Observe how the web addresses change between the first page and the second.
2. Remember we have talked about **avoid overloading servers** in ethics. Make sure to use `time.sleep()`.

If you would like to challenge yourself, see if you can scrape the full text (not the short description) of the news. Try with one or two pieces of news would be enough.

In [None]:
# the extra packages you will need
import time
import csv

In [None]:
# put your code here














---------
### Congratulations, we are done!

This notebook is written by [Meng Cai](https://www.linkedin.com/in/caimeng2/), Technical University of Darmstadt. Special thanks to [Dirk Colbry](https://www.linkedin.com/in/dirkcolbry/) for sharing his course materials on this topic. This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a>