# Data Reading

In this step, you will read the data from the provided GitHub URL and load it into a pandas DataFrame. This allows you to easily manipulate and analyze the data using pandas' powerful data analysis tools.

In [0]:
import pandas as pd

url = "https://raw.githubusercontent.com/Tao-Pi/CAS-Applied-Data-Science/refs/heads/main/DSF5%20-%20copy/01_Module%20Final%20Assignment/Top-10000-posts-by-page-views-Sep-01-2025-Sep-30-2025-eng-swissinfo-ch-any.csv"
df = pd.read_csv(url)
display(df)

# Extracting Article Content with BeautifulSoup

In this step, we will use the BeautifulSoup library to loop over the URLs in our DataFrame and extract key components from each article page. Specifically, we will retrieve the article header, lead (subtitle or introduction), and the main article text. This process involves sending HTTP requests to each URL, parsing the HTML content, and selecting the relevant elements for extraction. The extracted information will be added as new columns to our DataFrame for further analysis.

In [0]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from concurrent.futures import ThreadPoolExecutor

def extract_swi_article(url):
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")

        # Header (Titel)
        header = soup.find(class_="article-header")
        header_text = header.get_text(strip=True, separator=" ") if header else None

        # Lead (Untertitel oder Einstieg)
        lead = soup.find(class_="lead-text")
        lead_text = lead.get_text(strip=True, separator=" ") if lead else None

        # Haupttext (Absätze im Artikel)
        article = soup.select_one("#main-content main article")
        if article:
            paragraphs = [p.get_text(strip=True) for p in article.find_all("p")]
            article_text = "\n".join(paragraphs)
        else:
            article_text = None

        return {
            "header_text": header_text,
            "lead_text": lead_text,
            "article_text": article_text
        }
    except Exception as e:
        return {
            "header_text": None,
            "lead_text": None,
            "article_text": f"Error: {e}"
        }

urls = df["URL"].tolist()
with ThreadPoolExecutor(max_workers=16) as executor:
    results = list(executor.map(extract_swi_article, urls))

results_df = pd.DataFrame(results)
df[["header_text", "lead_text", "article_text"]] = results_df

display(df[["URL", "header_text", "lead_text", "article_text"]])