# Scraping Open Data Documents from Rijksoverheid

This notebook scrapes and processes ICT-related documents from the Dutch government's open data portal. Information on how to access these documents is available on the website of [Rijksoverheid (the Dutch Government website](https://www.rijksoverheid.nl/opendata/documenten). The goal is to compile a dataset of these documents, including their metadata and content, into a CSV file.

### Importing libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import xml.etree.ElementTree as ET
from tqdm import tqdm
import fitz  # Import PyMuPDF
import os
import io
from pathlib import Path

### Fetching the list of documents

In [2]:
# create directory
Path("../Data/Rijksoverheid").mkdir(parents=True, exist_ok=True)

In [3]:
# Function to fetch documents
def fetch_documents(subject, initial_date, offset, rows):
    """
    Fetches documents from the Rijksoverheid API based on the specified parameters.
    :param subject: the subject of the documents to fetch
    :param initial_date: the initial date from which to fetch documents
    :param offset: the offset to start fetching documents from
    :param rows: the number of rows to fetch
    :return: the XML response text if successful, None otherwise
    """
    base_url = "https://opendata.rijksoverheid.nl/v1/documents"
    params = {
        "subject": subject,
        "initialdatesince": initial_date,
        "offset": offset,
        "rows": rows
    }
    response = requests.get(base_url, params=params)
    if response.status_code == 200:
        return response.text
    else:
        return None

### Parsing the XML response

In [4]:
# Function to parse XML and extract document metadata
def parse_xml(xml_data):
    """
    Parses the XML data and extracts the metadata for each document.
    :param xml_data: XML data to parse
    :return: a list of dictionaries containing the metadata for each document
    """
    documents = []
    root = ET.fromstring(xml_data)
    for doc in root.findall('document'):
        metadata = {
            "id": doc.find('id').text,
            "type": doc.find('type').text,
            "title": doc.find('title').text,
            "canonical": doc.find('canonical').text,
            "introduction": "",
            "lastmodified": doc.find('lastmodified').text,
            "available": doc.find('available').text,
            "initialdate": doc.find('initialdate').text,
        }
        
        # Handling introduction extraction within <p> tags
        intro_html = doc.find('introduction').text
        if intro_html is not None:
            intro_soup = BeautifulSoup(intro_html, 'html.parser')
            paragraphs = [p.get_text() for p in intro_soup.find_all('p')]
            metadata['introduction'] = " ".join(paragraphs).strip()
        
        documents.append(metadata)
    return documents

### Downloading and extracting text from the PDFs

In [5]:
def download_pdf(url):
    """
    Downloads a PDF file from the specified URL and returns the path to the downloaded file.
    :param url: the URL of the PDF file to download
    :return: the path to the downloaded PDF file if successful, None otherwise
    """
    # Temporary file path
    temp_file_path = "../Data/Rijksoverheid/temp.pdf"
    try:
        # Make a request to the PDF URL
        response = requests.get(url)
        response.raise_for_status()
        
        # Save the pdf to a temporary file
        with open(temp_file_path, 'wb') as temp_pdf:
            temp_pdf.write(response.content)
            return temp_file_path
    except Exception as e:
        return None

In [6]:
# Function to download PDF and extract text
def extract_text_from_pdf(pdf_path, url):
    """
    Extracts text from a PDF.
    :param pdf_path: the path to the PDF file
    :param url: the URL of the PDF file for debugging purposes
    :return: the extracted text from the PDF
    """
    text = ""
    try:
        # Attempt to open and read the PDF
        with fitz.open(pdf_path) as doc:
            for page in doc:    
                text += page.get_text()
            
            text = clean_text(text)            
                
    except Exception as e:
        print(f"An error occurred while extracting text from url {url}: {e}")
        text = ""
    
    # Ensure the temporary PDF file is deleted
    if os.path.exists(pdf_path):
        os.remove(pdf_path)
        
    return text

In [7]:
def clean_text(text):
    """
    Cleans the extracted text from a PDF. Removes newlines, carriage returns, and tabs. Attempts to find the start of the content by searching for the words 'aanleiding' or 'geacht'.
    :param text: the raw extracted text from the PDF
    :return: the cleaned text from the PDF
    """
    # Strip the text of newlines, carriage returns, and tabs
    text = text.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').strip()
    
    # Find start index of the words 'aanleiding' and 'geacht'
    start_idx1 = text.find('Aanleiding')
    start_idx2 = text.find('Geacht')
    
    # Adjust handling to account for cases where either keyword may not be found
    start_indices = [i for i in [start_idx1, start_idx2] if i != -1]
    # If at least one valid index is found and not larger than 1000
    if start_indices and min(start_indices) < 1000:
        start_idx = min(start_indices)
        text = text[start_idx:]
    else: # If no valid start index found, use a default start position
        text = text[70:]
        
    return text

### Scraping the content

In [8]:
# Updated scrape_content function to handle PDF content
def scrape_content(url):
    """
    Scrapes the content from the URL. Often the content is in a PDF file, so we attempt to download and extract the text from the PDF.
    :param url: the URL to scrape content from
    :return: the scraped content if successful, an empty string otherwise
    """
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        intro_div = soup.find('div', class_='intro')
        # Try to find a PDF link in the introduction div, else get all <p> tags
        if intro_div:
            pdf_link = intro_div.find('a', href=True)['href'] if intro_div.find('a', href=True) else None
            if pdf_link:
                pdf_path = download_pdf(pdf_link)
                if pdf_path:
                    text = extract_text_from_pdf(pdf_path, url)
                    return text
        # If no PDF is found or extraction failed, get all <p> tags
        text = []
        for element in soup.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li']):
            text.append(element.get_text(strip=True))
        text = " ".join(text).strip()
        return text
            
    return ""

### Fetching and processing documents

In [9]:
# Adjusted main function to skip certain document types
def fetch_and_save_documents(subject, initial_date, csv_file_path):
    """
    Fetches and processes documents from the Rijksoverheid API based on the specified parameters and saves them to a CSV file.
    :param subject: the subject of the documents to fetch (see https://www.rijksoverheid.nl/opendata/documenten)
    :param initial_date: the initial date from which to fetch documents (format: YYYYMMDD)
    :param csv_file_path: the path to save the CSV file
    :return: None
    """
    offset = 1
    rows = 200
    file_exists = os.path.exists(csv_file_path)
    
    while True:
        xml_data = fetch_documents(subject, initial_date, offset, rows)
        if xml_data:
            documents = parse_xml(xml_data)
            if not documents:
                break
            batch_documents = []
            for doc in tqdm(documents, desc=f"Processing documents {offset} to {offset + len(documents) - 1}"):
                doc['content'] = scrape_content(doc['canonical'])
                batch_documents.append(doc)
                
            # Convert the batch documents to a DataFrame
            batch_df = pd.DataFrame(batch_documents)
            # Append to the CSV file
            if not file_exists:
                batch_df.to_csv(csv_file_path, mode='w', header=True, index=False)
                file_exists = True # Ensure header is written only once
            else:
                batch_df.to_csv(csv_file_path, mode='a', header=False, index=False)
            
            offset += rows
        else:
            break
    print("All documents processed and saved to CSV file.")

### Unleashing the scraper

In [10]:
csv_file_path = '../Data/Rijksoverheid/documents_ict_20200101.csv'
fetch_and_save_documents('ict', '20200101', csv_file_path)

Processing documents 1 to 200:  97%|█████████▋| 194/200 [03:02<00:04,  1.46it/s]

An error occurred while extracting text from url https://www.rijksoverheid.nl/documenten/beleidsnotas/2022/11/11/beslisnota-bij-antwoorden-kamervragen-over-toegang-tik-tok-medewerkers-tot-data-europese-gebruikers: Failed to open file '../Data/Rijksoverheid/temp.pdf'.


Processing documents 1 to 200: 100%|██████████| 200/200 [03:06<00:00,  1.07it/s]
Processing documents 201 to 400:   4%|▍         | 9/200 [00:09<02:23,  1.33it/s]

An error occurred while extracting text from url https://www.rijksoverheid.nl/documenten/kamerstukken/2022/11/24/tk-advies-van-het-adviescollege-ict-toetsing-over-het-beslag-informatie-systeem: Failed to open file '../Data/Rijksoverheid/temp.pdf'.


Processing documents 201 to 400: 100%|██████████| 200/200 [03:57<00:00,  1.19s/it]
Processing documents 401 to 427: 100%|██████████| 27/27 [00:16<00:00,  1.66it/s]


All documents processed and saved to CSV file.


In [11]:
# check the first few rows of the dataset
df = pd.read_csv('../Data/Rijksoverheid/documents_ict_20200101.csv')
df.head()

Unnamed: 0,id,type,title,canonical,introduction,lastmodified,available,initialdate,content
0,67ffb596-fe03-4c41-88f3-69bc783243f3,kamerstuk,Kamerbrief over cyberveiligheid in het onderwijs,https://www.rijksoverheid.nl/documenten/kamers...,Minister Van Engelshoven informeert de Tweede ...,2022-04-12T09:35:57.748Z,2020-02-14T18:26:00.000Z,2020-02-14T01:00:00.000+01:00,13 Op 23 december 2019 ...
1,b1caa73e-1fba-43dc-8d8d-5d40c98f0c2e,kamerstuk,Beantwoording Kamervragen over het bericht ‘Ha...,https://www.rijksoverheid.nl/documenten/kamers...,Minister Bruins beantwoordt vragen over het be...,2022-08-31T09:05:54.038Z,2020-02-21T15:55:00.000Z,2020-02-21T00:00:00.000+01:00,"Geachte voorzitter, Hierbij zend ik u, mede..."
2,0dc59efa-9714-4d42-9fb7-4b6444c2cc70,kamerstuk,Kamerbrief over doorlichting SSC-ICT,https://www.rijksoverheid.nl/documenten/kamers...,Minister Knops (BZK) stuurt de Tweede Kamer he...,2022-04-04T08:30:53.086Z,2020-03-02T12:59:00.000Z,2020-03-02T00:00:00.000+01:00,0018 2500 EA DEN HAAG Datum 2 maart...
3,853678f5-b676-4572-b189-3be188961774,kamerstuk,Kamerbrief over doorontwikkeling Rijks ICT-das...,https://www.rijksoverheid.nl/documenten/kamers...,Minister Knops (BZK) informeert de Tweede Kame...,2022-08-17T08:31:27.904Z,2020-03-04T14:02:00.000Z,2020-03-04T00:00:00.000+01:00,18 2500 AE Den Haag Datum 4 maart 2020 Betref...
4,00324347-7075-46bb-8b1a-8df05ea83ba4,kamerstuk,Beantwoording Kamervragen over zorgwekkende si...,https://www.rijksoverheid.nl/documenten/kamers...,Minister Knops beantwoordt vragen over de zorg...,2022-08-12T08:03:54.964Z,2020-03-06T15:31:00.000Z,2020-03-06T00:00:00.000+01:00,18 2500 EA Den Haag Datum 6 maart 2020 Betref...


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 427 entries, 0 to 426
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            427 non-null    object
 1   type          427 non-null    object
 2   title         427 non-null    object
 3   canonical     427 non-null    object
 4   introduction  427 non-null    object
 5   lastmodified  427 non-null    object
 6   available     427 non-null    object
 7   initialdate   427 non-null    object
 8   content       422 non-null    object
dtypes: object(9)
memory usage: 30.2+ KB


In [13]:
# How many rows have NaN values in the content column?
df[df['content'].isna()]

Unnamed: 0,id,type,title,canonical,introduction,lastmodified,available,initialdate,content
167,031c9ba6-b36f-444f-a683-9a2d134a8021,beleidsnota,Beslisnota bij antwoorden op Kamervragen over ...,https://www.rijksoverheid.nl/documenten/beleid...,In een beslisnota staat achtergrondinformatie ...,2022-11-22T14:10:28.258Z,2022-11-18T09:32:00.000Z,2022-11-18T10:35:08.494+01:00,
193,c9a0943e-a057-4db2-bf52-a2f1ac8969d1,beleidsnota,Beslisnota bij antwoorden Kamervragen over toe...,https://www.rijksoverheid.nl/documenten/beleid...,In een beslisnota staat achtergrondinformatie ...,2022-11-28T09:19:38.056Z,2022-11-11T16:55:00.000Z,2022-11-11T18:12:56.474+01:00,
205,12741548-91f5-4f41-b0cc-d380bf4125b1,rapport,Convenant waarborging .nl-domein 2022-2029,https://www.rijksoverheid.nl/documenten/rappor...,Het convenant tussen EZK en de Stichting Inter...,2022-12-07T14:26:05.870Z,2022-11-23T10:14:00.000Z,2022-11-23T11:20:07.248+01:00,
208,e8c135f0-bade-4272-bdfe-1cfef4532e89,kamerstuk,Kamerbrief over advies van het Adviescollege I...,https://www.rijksoverheid.nl/documenten/kamers...,Minister Yeşilgöz-Zegerius (JenV) stuurt de Tw...,2022-12-07T09:12:45.219Z,2022-11-24T18:28:00.000Z,2022-11-24T18:40:07.314+01:00,
335,c69b41f5-8fdd-4409-8d00-8b2827a059f8,beleidsnota,Beslisnota bij Kamerbrief over planning en voo...,https://www.rijksoverheid.nl/documenten/beleid...,In een beslisnota staat achtergrondinformatie ...,2024-02-14T10:55:13.395Z,2024-02-14T10:48:57.349Z,2024-02-14T08:50:04.933+01:00,


In [14]:
df['content'][430]

KeyError: 430