# Download Data

## Download through API (10 at a time)

#### Erläuterungen und Hinweise zur API für DIP (Kurzdokumentation)

Pro Anfrage werden maximal 100 Entitäten ausgegeben. Die Volltext-Ressourcentypen sind in der Regel auf maximal 10 Entitäten begrenzt. In jedem Falle können weitere Entitäten, sofern verfügbar, immer mittels des cursor-Parameters geladen werden.


Die Sortierung der Entitäten erfolgt stets absteigend nach Datum und ID.

In [1]:
import os
import requests
from xml.etree import ElementTree as ET
import time


In [2]:
def get_api_url(api_key, cursor=None):
    headers = {"Authorization": f"ApiKey {api_key}"}
    base_url = "https://search.dip.bundestag.de/api/v1/drucksache-text"
    params = {"f.wahlperiode": 19, "format": "xml", "f.datum.end": date_end, "apikey": api_key}
    
    if cursor:
        params["cursor"] = cursor

    response = requests.get(base_url, params=params, headers=headers)
    
    if response.status_code == 200:
        api_url = response.url
        return api_url, response.cookies
    else:
        print(f"Fehler bei der Challenge-Anfrage: {response.status_code}")
        return None, None
    
def download_data(api_url, cookies):
    try:
        response = requests.get(api_url, cookies=cookies)
        response.raise_for_status()
        
        data = ET.fromstring(response.content.decode('utf-8', errors='ignore'))
        return data
    except requests.exceptions.RequestException as e:
        print(f"Fehler bei der Anfrage: {e}")
        return None

def save_data_as_file(data, folder_path):
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

    id_nummer = data.find(".//id").text
    filename = f"{id_nummer}.xml"
   
    
    file_path = os.path.join(folder_path, filename)
    
    with open(file_path, "wb") as file:
        file.write(ET.tostring(data))
    print(f"Data saved to {file_path}")

In [3]:
api_key = "rgsaY4U.oZRQKUHdJhF9qguHMkwCGIoLaqEcaHjYLF"

In [4]:
folder_path = "./xml_chunk/"
cursor = None
date_end = None

while True:
    api_url, cookies = get_api_url(api_key, cursor)
    
    if api_url is not None:
        data = download_data(api_url, cookies)
        
        if data is not None:
            save_data_as_file(data, folder_path)
            cursor = data.find(".//cursor").text
            if not cursor:
                break  
        else:
            break  
    else:
        break  
    time.sleep(1)

Data saved to ./xml_files_new_test/258601.xml
Data saved to ./xml_files_new_test/258190.xml
Data saved to ./xml_files_new_test/258281.xml
Data saved to ./xml_files_new_test/258280.xml
Data saved to ./xml_files_new_test/258502.xml
Data saved to ./xml_files_new_test/258143.xml
Data saved to ./xml_files_new_test/258134.xml
Data saved to ./xml_files_new_test/258122.xml
Data saved to ./xml_files_new_test/258112.xml
Data saved to ./xml_files_new_test/258102.xml
Data saved to ./xml_files_new_test/258092.xml
Data saved to ./xml_files_new_test/258086.xml
Data saved to ./xml_files_new_test/258070.xml
Data saved to ./xml_files_new_test/258052.xml
Data saved to ./xml_files_new_test/258039.xml
Data saved to ./xml_files_new_test/258047.xml
Data saved to ./xml_files_new_test/258016.xml
Data saved to ./xml_files_new_test/258006.xml
Data saved to ./xml_files_new_test/257996.xml
Data saved to ./xml_files_new_test/257988.xml
Data saved to ./xml_files_new_test/257973.xml
Data saved to ./xml_files_new_test

KeyboardInterrupt: 

## !! ERROR !!

Stop at date= 2021-04-29 an id=253362

Missing data: between date= 2021-04-29 an id=253362 and date = 2021-04-27, so restart with restiroction "f.datum.end":"2021-04-27"

In [5]:
folder_path = "./xml_chunk/"
cursor = None
date_end = "2021-04-27"
while True:
    api_url, cookies = get_api_url(api_key, cursor)
    
    if api_url is not None:
        data = download_data(api_url, cookies)
        
        if data is not None:
            save_data_as_file(data, folder_path)
            cursor = data.find(".//cursor").text
            if not cursor:
                break 
        else:
            break  
    else:
        break  
    time.sleep(1)

Data saved to ./xml_files_new_test/253710.xml
Data saved to ./xml_files_new_test/253318.xml
Data saved to ./xml_files_new_test/253308.xml
Data saved to ./xml_files_new_test/253297.xml
Data saved to ./xml_files_new_test/253287.xml
Data saved to ./xml_files_new_test/253276.xml
Data saved to ./xml_files_new_test/253264.xml
Data saved to ./xml_files_new_test/253241.xml
Data saved to ./xml_files_new_test/253328.xml
Data saved to ./xml_files_new_test/253257.xml
Data saved to ./xml_files_new_test/253218.xml
Data saved to ./xml_files_new_test/253208.xml
Data saved to ./xml_files_new_test/253198.xml
Data saved to ./xml_files_new_test/253182.xml
Data saved to ./xml_files_new_test/253100.xml
Data saved to ./xml_files_new_test/253232.xml
Data saved to ./xml_files_new_test/253169.xml
Data saved to ./xml_files_new_test/253159.xml
Data saved to ./xml_files_new_test/253148.xml
Data saved to ./xml_files_new_test/253138.xml
Data saved to ./xml_files_new_test/253126.xml
Data saved to ./xml_files_new_test

KeyboardInterrupt: 

## !! ERROR !!

Timeout around date = 2018-04-13, so a resart after that. 

In [6]:
folder_path = "./xml_chunk/"
cursor = None
date_end = "2018-04-13"
while True:
    api_url, cookies = get_api_url(api_key, cursor)
    
    if api_url is not None:
        data = download_data(api_url, cookies)
        
        if data is not None:
            save_data_as_file(data, folder_path)
            cursor = data.find(".//cursor").text
            if not cursor:
                break 
        else:
            break  
    else:
        break  
    time.sleep(1)

Data saved to ./xml_files_new_test/218726.xml
Data saved to ./xml_files_new_test/218661.xml
Data saved to ./xml_files_new_test/218651.xml
Data saved to ./xml_files_new_test/218638.xml
Data saved to ./xml_files_new_test/218629.xml
Data saved to ./xml_files_new_test/218621.xml
Data saved to ./xml_files_new_test/218611.xml
Data saved to ./xml_files_new_test/218598.xml
Data saved to ./xml_files_new_test/218585.xml
Data saved to ./xml_files_new_test/218575.xml
Data saved to ./xml_files_new_test/218605.xml
Data saved to ./xml_files_new_test/218558.xml
Data saved to ./xml_files_new_test/218545.xml
Data saved to ./xml_files_new_test/218534.xml
Data saved to ./xml_files_new_test/218522.xml
Data saved to ./xml_files_new_test/218552.xml
Data saved to ./xml_files_new_test/218503.xml
Data saved to ./xml_files_new_test/218493.xml
Data saved to ./xml_files_new_test/218476.xml
Data saved to ./xml_files_new_test/218446.xml
Data saved to ./xml_files_new_test/218436.xml
Data saved to ./xml_files_new_test

KeyboardInterrupt: 

### Separate files

In [24]:
def split_and_save_all(input_folder, output_folder):
    os.makedirs(output_folder, exist_ok=True)

    for filename in os.listdir(input_folder):
        if filename.endswith('.xml'):
            input_xml = os.path.join(input_folder, filename)

            try:
                tree = ET.parse(input_xml)
                root = tree.getroot()

                for document in root.findall('.//document'):
                    doc_id = document.find(".//dokumentnummer").text.replace("/", "_")
                    herausgeber = document.find(".//herausgeber").text

                    if herausgeber == 'BR':
                        doc_id = f'BR_{doc_id}'

                    doc_tree = ET.ElementTree(document)

                    output_file = os.path.join(output_folder, f'{doc_id}.xml')
                    doc_tree.write(output_file)

                    #print(f'Document {doc_id} from {filename} saved to {output_file}')
            except ET.ParseError as e:
                print(f"Error parsing {filename}: {e}")


In [29]:
input_folder = './xml_chunk'
output_folder = './xml_single'

split_and_save_all(input_folder, output_folder)

Document 19_25639 from 19_25639.xml saved to ./xml_new/19_25639.xml
Document 19_25637 from 19_25639.xml saved to ./xml_new/19_25637.xml
Document 19_25627 from 19_25639.xml saved to ./xml_new/19_25627.xml
Document 19_25626 from 19_25639.xml saved to ./xml_new/19_25626.xml
Document 19_26195 from 19_25639.xml saved to ./xml_new/19_26195.xml
Document BR_zu497_20(B) from 19_25639.xml saved to ./xml_new/BR_zu497_20(B).xml
Document 19_25621 from 19_25639.xml saved to ./xml_new/19_25621.xml
Document 19_25618 from 19_25639.xml saved to ./xml_new/19_25618.xml
Document 19_25607 from 19_25639.xml saved to ./xml_new/19_25607.xml
Document 19_25605 from 19_25639.xml saved to ./xml_new/19_25605.xml
Document 19_924 from 19_924.xml saved to ./xml_new/19_924.xml
Document 19_911 from 19_924.xml saved to ./xml_new/19_911.xml
Document 19_909 from 19_924.xml saved to ./xml_new/19_909.xml
Document 19_908 from 19_924.xml saved to ./xml_new/19_908.xml
Document 19_907 from 19_924.xml saved to ./xml_new/19_907.xm

KeyboardInterrupt: 

## Check for missing files

In [30]:
def find_missing_files(directory, start_range, end_range):
    existing_files = set()

    for filename in os.listdir(directory):
        if filename.endswith(".xml") and filename.startswith("19_"):
            try:
                file_number = int(filename.split('_')[1].split('.')[0])
                existing_files.add(file_number)
            except ValueError:
                pass

    missing_files = set(range(start_range, end_range + 1)) - existing_files

    if missing_files:
        print(f"Missing files in {directory}: {missing_files}")
    else:
        print(f"No missing files in {directory}.")

In [31]:
start_range = 1
end_range = 32713
directory_path = "./xml_single"

find_missing_files(directory_path, start_range, end_range)

Missing files in ./xml_new: {29085, 29087, 29088, 29089, 29091, 29092, 29094, 29095, 29096, 29097, 29098, 29099, 29100, 29101, 29102, 29103, 29104, 29105, 29106, 29107, 29108, 29130, 29131, 29132, 29133, 29134, 29135, 29136, 29137, 29170, 29171, 29682}


## Download missing data

Download data for the missing files.

In [32]:
document_numbers = ["19/29085", "19/29087", "19/29088", "19/29089", "19/29091", "19/29092", "19/29094", "19/29095",
                    "19/29096", "19/29097", "19/29098", "19/29099", "19/29100", "19/29101", "19/29102", "19/29103",
                    "19/29104", "19/29105", "19/29106", "19/29107", "19/29108", "19/29130", "19/29131", "19/29132",
                    "19/29133", "19/29134", "19/29135", "19/29136", "19/29137", "19/29170", "19/29171", "19/29682"]
base_url = "https://search.dip.bundestag.de/api/v1/drucksache-text"
format_type = "xml"
output_folder = "./xml_missing"

os.makedirs(output_folder, exist_ok=True)

for document_number in document_numbers:
    url = f"{base_url}?format={format_type}&f.dokumentnummer={document_number}&apikey={api_key}"

    try:
        response = requests.get(url)
        response.raise_for_status() 

        data = response.text
        with open(os.path.join(output_folder, f"{document_number.replace('/', '_')}.xml"), "w") as file:
            file.write(data)

    except requests.exceptions.RequestException as e:
        print(f"Error downloading data for document number {document_number}: {e}")


## split missing data

Data is just being split, so that the \<response> part is gone.  
No more need to split data, as the data is downloaded individually.

When first running the data there was an error in file 19_29682, as in the text there were several mentions of "&#xffff", which could not be processed. After manually deliting them, everything works fine. 

In [37]:
input_folder = './xml_missing'
output_folder = './xml_single'

split_and_save_all(input_folder, output_folder)

Document 19_29171 from 19_29171.xml saved to ./xml_new/19_29171.xml
Document 19_29170 from 19_29170.xml saved to ./xml_new/19_29170.xml
Document 19_29100 from 19_29100.xml saved to ./xml_new/19_29100.xml
Document 19_29101 from 19_29101.xml saved to ./xml_new/19_29101.xml
Document 19_29088 from 19_29088.xml saved to ./xml_new/19_29088.xml
Document 19_29103 from 19_29103.xml saved to ./xml_new/19_29103.xml
Document 19_29102 from 19_29102.xml saved to ./xml_new/19_29102.xml
Document 19_29089 from 19_29089.xml saved to ./xml_new/19_29089.xml
Document 19_29099 from 19_29099.xml saved to ./xml_new/19_29099.xml
Document 19_29106 from 19_29106.xml saved to ./xml_new/19_29106.xml
Document 19_29107 from 19_29107.xml saved to ./xml_new/19_29107.xml
Document 19_29098 from 19_29098.xml saved to ./xml_new/19_29098.xml
Document 19_29105 from 19_29105.xml saved to ./xml_new/19_29105.xml
Document 19_29104 from 19_29104.xml saved to ./xml_new/19_29104.xml
Document 19_29096 from 19_29096.xml saved to ./x

## Recheck for missing files

In [38]:
start_range = 1
end_range = 32713
directory_path = "./xml_single"

find_missing_files(directory_path, start_range, end_range)

No missing files in ./xml_new.
