# <center>Web Scraping</center>

<img src="../image/text_DALLE.jpeg" width=30% align="right" style="in-line">

>*Data is like garbage.*
>
>*You’d better know what you are going to do with it before you collect it.*
>
>— Mark Twain ? ( source: [Forbes](https://www.forbes.com/councils/forbestechcouncil/2023/05/09/the-delta-between-trust-and-usability-where-data-management-still-falls-short/) )

<img src="../image/quote1_ChatGPT.png" width=70% align="left" style="in-line">

## Agenda

1. Web page basics (see slides)
2. Web scraping with Python

<a name="2"></a>
## Agneda 2. Web scraping with Python

Sometimes webs scraping can be really easy, other times it can be complicated. 

- Easy: static HTML
- Hard: HTML and CSS
- Harder: Javascript - Often requires a "Headless" web browser

Let's start the web scraping. We will collect some news regarding mobility and transport.

This is the website: [European Commission - Mobility and Transport News](https://transport.ec.europa.eu/news-events/news_en?page=0).

&#x1F4D6; **<font color=teal>WHAT WE HAVE LEARNED: </font>Legal and ethical considerations** 
* **Terms of Use:** The European Commission allows the reuse of its content under certain conditions.
>Unless otherwise indicated (e.g. in individual copyright notices), content owned by the EU on this website is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) licence. This means that reuse is allowed, provided appropriate credit is given and changes are indicated.

     For educational purposes, reuse is usually permitted. Review [the Legal Notice](https://commission.europa.eu/legal-notice_en) to ensure compliance.
     

* **Robots.txt:** Check [the website's robots.txt](https://transport.ec.europa.eu/robots.txt) file to see any disallowed paths. We can see that https://transport.ec.europa.eu/news-events is not among the disallowed paths.

We will use a library called `requests` to  download web pages. The `requests` will make a [GET request](https://en.wikipedia.org/wiki/HTTP#Request_methods) to a web server, which will download the HTML contents of a given web page for us. And we will use a library called `BeautifulSoup` to parse the HTML document.

&#x270A; **<font color=firebrick>DO THIS: </font> Run the cell below to check if you have the libraries installed. If not, install them now.**

In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
url = "https://transport.ec.europa.eu/news-events/news_en?page=0"

In [4]:
page = requests.get(url)

In [5]:
print(page)

<Response [200]>


After running our request, we get a Response object. This object has a status code, which shows us if the page was downloaded successfully. A status code of 200 means that the page was downloaded successfully.

&#x1F4A1; **HTTP status codes** (Source: [wikipedia](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes))

>* 1xx informational response – the request was received, continuing process
>* 2xx successful – the request was successfully received, understood, and accepted
>* 3xx redirection – further action needs to be taken in order to complete the request
>* 4xx client error – the request contains bad syntax or cannot be fulfilled
>* 5xx server error – the server failed to fulfil an apparently valid request

We now use `BeautifulSoup` to parse the page.

In [6]:
soup = BeautifulSoup(page.content, 'html.parser')

In [7]:
# page.content
# soup

In [8]:
# dir(soup)
# help(soup)

In [9]:
# print out the HTML content of the page
#print(soup)
print(soup.prettify()) #format the page nicely

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="og: https://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="News" name="description"/>
  <link href="https://transport.ec.europa.eu/news-events/news_en" rel="canonical"/>
  <meta content="follow, noindex" name="robots"/>
  <meta content="auto" property="og:determiner"/>
  <meta content="Mobility and Transport" property="og:site_name"/>
  <meta content="website" property="og:type"/>
  <meta content="https://transport.ec.europa.eu/news-events/news_en" property="og:url"/>
  <meta content="News" property="og:title"/>
  <meta content="News" property="og:description"/>
  <meta content="https://transport.ec.europa.eu/profiles/contrib/ewcms/modules/ewcms_seo/assets/images/ec-socialmedia-fallback.png" property="og:image"/>
  <meta content="Mobility and Transport" property="og:image:alt"/>
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="News" name="twitter:title"/>
  <meta content="News" name="twitter:

The task now is to **locate the specific content that we want to scrape**. You can view the page structure in a browser (for example: in Chrome by clicking `View` -> `Developer` -> `Inspect Elements`).

Once locate the content, look for the tag and attribute of the target element.

&#x1F4D6; **<font color=teal>WHAT WE HAVE LEARNED: </font> HTML tags and attributes**

<img src="../image/HTML_element.png" width=50% align="left" >

There could be more than one way to locate a target element. 

&#x1F4A1; **HTML elements:** [documentation](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

In [10]:
news = soup.find_all("div", class_="ecl-content-item-block__item")
#news = soup.find_all("article", class_="ecl-content-item")

In [11]:
#print(news)
len(news)

10

In [12]:
print(news[0]) # the first news

<div class="ecl-content-item-block__item contextual-region ecl-u-mb-l ecl-col-12"><article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2025-10-27T12:00:00Z">27 October 2025</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone" href="/news-events/news/solidarity-lanes-latest-figures-september-2025-2025-10-27_en">Solidarity Lanes: Latest figures – September 2025</a></div><div class="ecl-content-block__description"><p>Latest figures on Ukrainian exports and imports via the EU-Ukraine Solidarity Lanes: new transport routes established in the face of Russia’s war of aggression against Ukraine.</p></div><ul class="ecl-content-block__secondary-me

#copy paste it in a Markdown cell here

<div class="ecl-content-item-block__item contextual-region ecl-u-mb-l ecl-col-12"><article class="ecl-content-item"><div class="ecl-content-block ecl-content-item__content-block" data-ecl-auto-init="ContentBlock" data-ecl-content-block=""><ul class="ecl-content-block__primary-meta-container"><li class="ecl-content-block__primary-meta-item">News article</li><li class="ecl-content-block__primary-meta-item"><time datetime="2025-10-27T12:00:00Z">27 October 2025</time></li></ul><div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone" href="/news-events/news/solidarity-lanes-latest-figures-september-2025-2025-10-27_en">Solidarity Lanes: Latest figures – September 2025</a></div><div class="ecl-content-block__description"><p>Latest figures on Ukrainian exports and imports via the EU-Ukraine Solidarity Lanes: new transport routes established in the face of Russia’s war of aggression against Ukraine.</p></div><ul class="ecl-content-block__secondary-meta-container"><li class="ecl-content-block__secondary-meta-item"><span class="wt-icon--clock ecl-icon ecl-icon--s ecl-content-block__secondary-meta-icon ecl-icon--clock"></span><span class="ecl-content-block__secondary-meta-label">3 min read</span></li></ul></div></article></div>

In [13]:
# a for loop to get all the titles
for item in news:
    title = item.find("a", class_="ecl-link ecl-link--standalone")
    print(title.get_text())
    print("====")

Solidarity Lanes: Latest figures – September 2025
====
New report shows progress in sustainable aviation fuel uptake across the EU
====
Road Safety Statistics for 2024: Progress continues amid persistent challenges  
====
Young people and transport jobs: making transport careers more attractive
====
The new Entry/Exit System went live on 12 October
====
Statement ahead of the IMO’s Marine Environment Protection Committee (MEPC) meeting
====
EU and Azerbaijan deepen transport cooperation at second transport dialogue 
====
October infringements package: key decisions
====
New Performance Review Board appointed to drive implementation of the Single European Sky
====
EU welcomes UN aviation agency’s condemnation of Russia for undermining global aviation safety
====


A note to myself: go back to the slides before the big task.

&#x270A; **<font color=firebrick>DO THIS: </font>** Here is an example project. We would like to find out what the European Union has done recently (let's say since 2024) to advance sustainable mobility and transport. One possible data source is the news we just scraped, but we need more information other than the title of the news.

So now please write some code to collect **the date, the title, the short description, the news type, and the link to the full text** of all news in 2024. Save the data to a **csv** file.

Here are some tips:
1. How many pages do you need to scrape? Observe how the web addresses change between the first page and the second.
2. Remember we have talked about **avoid overloading servers** in ethics. Make sure to use `time.sleep()`.

If you would like to challenge yourself, see if you can scrape the full text (not the short description) of the news. Try with one or two pieces of news would be enough.

In [14]:
# the extra packages you will need
import time
import csv

In [15]:
import time
import csv
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import os

def scrape_news():
    # Base URL for the news pages
    base_url = "https://transport.ec.europa.eu/news-events/news_en"
    news_data = []
    seen_urls = set()
    
    # Start with page 0
    current_page = 0
    empty_pages_count = 0
    max_pages = 50
    
    print("Starting news collection...")
    
    while current_page < max_pages:
        current_url = f"{base_url}?page={current_page}"
        print(f"\nFetching page {current_page}...")
        
        # Be nice to the server
        time.sleep(5)
        
        try:
            response = requests.get(current_url)
            response.raise_for_status()
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Find all news articles on the page
            articles = soup.find_all("div", class_="ecl-content-item-block__item")
            
            if not articles:
                print(f"No articles found on page {current_page}. Stopping.")
                break
                
            articles_from_2024 = 0
            
            for article in articles:
                # Get title and link
                title_elem = article.find("a", class_="ecl-link ecl-link--standalone")
                if not title_elem:
                    continue
                    
                link = "https://transport.ec.europa.eu" + title_elem['href']
                if link in seen_urls:
                    continue
                    
                # Get date
                date_elem = article.find("time")
                if not date_elem:
                    continue
                    
                try:
                    date = datetime.strptime(date_elem.get_text().strip(), "%d %B %Y")
                    if date.year != 2024:
                        continue
                except (ValueError, AttributeError):
                    continue
                
                # If we got here, we have a valid 2024 article we haven't seen before
                title = title_elem.get_text().strip()
                
                # Get type
                type_elem = article.find("li", class_="ecl-content-block__primary-meta-item")
                news_type = type_elem.get_text().strip() if type_elem else ""
                
                # Get description
                desc_elem = article.find("div", class_="ecl-content-block__description")
                description = desc_elem.get_text().strip() if desc_elem else ""
                
                # Add to our collection
                news_data.append({
                    'date': date,
                    'title': title,
                    'type': news_type,
                    'description': description,
                    'link': link
                })
                seen_urls.add(link)
                articles_from_2024 += 1
                print(f"Found 2024 article: {title}")
            
            # Update empty pages counter
            if articles_from_2024 == 0:
                empty_pages_count += 1
                print(f"No 2024 articles on page {current_page}. ({empty_pages_count} empty pages in a row)")
                if empty_pages_count >= 3:
                    print("\nNo new 2024 articles found in 3 consecutive pages. Stopping.")
                    break
            else:
                empty_pages_count = 0
                print(f"Found {articles_from_2024} new 2024 articles on this page")
                print(f"Total unique articles collected: {len(news_data)}")
            
            # Move to next page
            current_page += 1
            
        except requests.RequestException as e:
            print(f"Error accessing page {current_page}: {e}")
            break
    
    return news_data

# Run the scraper
print("Starting web scraping...")
collected_news = scrape_news()

# Save results to CSV
if collected_news:
    csv_filename = 'transport_news_2024.csv'
    fieldnames = ['date', 'title', 'type', 'description', 'link']
    
    with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for item in collected_news:
            item['date'] = item['date'].strftime("%Y-%m-%d")
            writer.writerow(item)
    
    print(f"\nSuccessfully saved {len(collected_news)} unique articles to {csv_filename}")
    print("File location:", os.path.abspath(csv_filename))
else:
    print("\nNo articles from 2024 were found.")

Starting web scraping...
Starting news collection...

Fetching page 0...
No 2024 articles on page 0. (1 empty pages in a row)

Fetching page 1...
No 2024 articles on page 0. (1 empty pages in a row)

Fetching page 1...
No 2024 articles on page 1. (2 empty pages in a row)

Fetching page 2...
No 2024 articles on page 1. (2 empty pages in a row)

Fetching page 2...
No 2024 articles on page 2. (3 empty pages in a row)

No new 2024 articles found in 3 consecutive pages. Stopping.

No articles from 2024 were found.
No 2024 articles on page 2. (3 empty pages in a row)

No new 2024 articles found in 3 consecutive pages. Stopping.

No articles from 2024 were found.


In [None]:
def scrape_full_article(url):
    """
    Scrape the full text content of a news article with improved formatting
    """
    # Be nice to the server
    time.sleep(3)
    
    try:
        response = requests.get(url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # First find the main content area
        main = soup.find('main')
        if not main:
            return "Main content area not found"
            
        # Look for content within the ecl container in main
        content_container = main.find("div", class_="ecl-container")
        if not content_container:
            return "Content container not found"
        
        # First get the title
        title = soup.find("h1")
        article_title = title.get_text().strip() if title else ""
        
        # Get metadata (date, type)
        meta_items = soup.find_all("li", class_="ecl-content-block__primary-meta-item")
        article_type = ""
        article_date = ""
        for item in meta_items:
            text = item.get_text().strip()
            if "article" in text.lower():
                article_type = text
            elif any(month in text for month in ['January', 'February', 'March', 'April', 'May', 'June', 
                                               'July', 'August', 'September', 'October', 'November', 'December']):
                article_date = text
        
        # Get all meaningful text content, excluding navigation and footer elements
        content_elements = content_container.find_all(['p', 'h2', 'h3', 'h4', 'ul', 'ol', 'li'])
        
        # Filter and clean the text, preserving structure
        sections = []
        current_section = []
        
        for element in content_elements:
            # Skip elements that are part of navigation or other UI components
            if element.find_parent(class_=['ecl-menu', 'ecl-site-header', 'ecl-site-footer']):
                continue
            
            text = element.get_text().strip()
            if not text or text.startswith(('Share this page:', 'Related topics:', 'Contact')):
                continue
                
            # If we hit a header, start a new section
            if element.name in ['h2', 'h3', 'h4']:
                if current_section:
                    sections.append('\n'.join(current_section))
                    current_section = []
                current_section.append(f"\n## {text}")
            elif element.name in ['ul', 'ol']:
                # Handle lists
                items = element.find_all('li')
                current_section.append('')  # Add spacing before list
                for item in items:
                    item_text = item.get_text().strip()
                    if item_text:
                        current_section.append(f"• {item_text}")
                current_section.append('')  # Add spacing after list
            else:
                current_section.append(text)
        
        # Add the last section if it exists
        if current_section:
            sections.append('\n'.join(current_section))
        
        # Construct the final formatted text
        formatted_parts = []
        if article_title:
            formatted_parts.append(f"# {article_title}")
        if article_type or article_date:
            meta = []
            if article_type:
                meta.append(article_type)
            if article_date:
                meta.append(article_date)
            formatted_parts.append(f"{' | '.join(meta)}\n")
        
        formatted_parts.extend(sections)
        
        full_text = '\n\n'.join(formatted_parts)
        return full_text if full_text else "No content found in the main area"
            
    except requests.RequestException as e:
        return f"Error fetching article: {e}"

# Get the saved news data from our CSV file
try:
    news_data = []
    with open('transport_news_2024.csv', 'r', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        news_data = list(reader)
    
    # Get the full text of the first two articles
    print("Fetching full text for the first two articles...")
    for i, article in enumerate(news_data[:2]):
        print(f"\n{'='*80}")  # Clear section separator
        print(f"Article {i+1}:")
        print(f"URL: {article['link']}")
        print('='*80)
        full_text = scrape_full_article(article['link'])
        print(full_text)
        print('='*80)
        
except FileNotFoundError:
    print("Please run the previous cell first to collect and save the news data.")

Fetching full text for the first two articles...

Processing article 1: Investments in climate adaptation should be an integral part of the trans-European transport network, study shows
URL: https://transport.ec.europa.eu/news-events/news/investments-climate-adaptation-should-be-integral-part-trans-european-transport-network-study-shows-2024-12-19_en

Full text:
--------------------------------------------------
A completed and climate-resilient trans-European transport network (TEN-T) is a cornerstone for growth, and both socio-economic and territorial cohesion in the EU. According to a recent European Commission study, integrating climate adaptation into TEN-T policies is essential to meet adaptation goals.

The impacts of climate change are already having tremendous repercussions in Europe. On several occasions in 2024, extreme weather events rendered transportation systems unusable, destroying infrastructure and disrupting supply chains for long periods.

The study warns that all T

In [21]:
# Let's look at the HTML structure of an article page
def inspect_article_page(url):
    print(f"Inspecting URL: {url}")
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Print all div classes to find the right container
    print("\nAll div classes found:")
    for div in soup.find_all('div', class_=True):
        print(f"- {div.get('class')}")
        
    # Let's look specifically for the main content area
    print("\nLooking for content sections:")
    main = soup.find('main')
    if main:
        print("Found <main> tag")
        for div in main.find_all('div', class_=True):
            print(f"- {div.get('class')}")

# Get an article URL from our CSV
try:
    with open('transport_news_2024.csv', 'r', encoding='utf-8') as csvfile:
        reader = csv.DictReader(csvfile)
        first_article = next(reader)
        inspect_article_page(first_article['link'])
except FileNotFoundError:
    print("Please run the main scraping cell first to generate the CSV file.")

Inspecting URL: https://transport.ec.europa.eu/news-events/news/investments-climate-adaptation-should-be-integral-part-trans-european-transport-network-study-shows-2024-12-19_en

All div classes found:
- ['dialog-off-canvas-main-canvas']
- ['ecl-site-header__header']
- ['ecl-site-header__inner']
- ['ecl-site-header__background']
- ['ecl-site-header__header']
- ['ecl-site-header__container', 'ecl-container']
- ['ecl-site-header__top']
- ['ecl-site-header__action']
- ['ecl-site-header__language']
- ['ecl-site-header__language-container']
- ['ecl-site-header__language-header']
- ['ecl-site-header__language-title']
- ['ecl-site-header__language-content']
- ['ecl-site-header__language-category']
- ['ecl-site-header__language-category-title']
- ['ecl-site-header__language-category']
- ['ecl-site-header__language-category-title']
- ['ecl-site-header__search-container']
- ['ecl-form-group']
- ['ecl-lang-select-page', 'ecl-u-pv-m', 'ecl-u-d-none']
- ['ecl-container']
- ['webtools-etrans--wrappe

---------
### Congratulations, we are done!

This notebook is written by [Meng Cai](https://www.linkedin.com/in/caimeng2/), Technical University of Darmstadt. Special thanks to [Dirk Colbry](https://www.linkedin.com/in/dirkcolbry/) for sharing his course materials on this topic. This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a>