# Scraping Official German Government Website
- Germany is divided into federal states, each containing districts (`Kreise`), and within each district is a local court (`Amtsgericht`) where foreclosure auctions (`Zwangsversteigerungen`) are conducted.
- Each local court uploads foreclosure listings to the central portal [ZVG Portal](https://www.zvg-portal.de/index.php?button=Suchen&all=1), which provides an overview of the auctions.
- Our data includes a variety of information, such as descriptions in text form, as well as PDFs like exposés, appraisals (`Gutachten`) and other documents.
- We will evaluate the value these PDFs provide, focusing specifically on foreclosures in Berlin to see if including this information improves our model predictions. This includes exploring approaches like image-based regression or converting images into text.
- Although other websites exist (e.g., [Versteigerungspool](https://versteigerungspool.de/), [Zwangsversteigerung.de](https://www.zwangsversteigerung.de/) or [ZVG-Online](https://www.zvg-online.net/)), they typically scrape the government site and offer a more stylized display with less information and less s. We will not scrape these sites, as they do not provide additional value beyond easier access to some parts of the data.

In [None]:
!python database_helpers.py

In [2]:

import requests 
from bs4 import BeautifulSoup
from datetime import datetime
import locale
import logging
import os
from code.database_helpers import ForeclosureCaseSchema, ForeclosureCaseModel, engine, session

locale.setlocale(locale.LC_TIME, 'de_DE')
os.makedirs("logs", exist_ok=True)
logging.basicConfig(
    filename=f"../logs/foreclosure_scraper_{int(datetime.now().timestamp())}.log",
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

### Federal State Codes
- We first retrieve the codes for each federal state. 
- While these are typically constants, we handle this dynamically in case any codes change, though such changes are rare.

In [3]:
url = "https://www.zvg-portal.de/index.php?button=Termine%20suchen"
response = requests.get(url)
search_soup = BeautifulSoup(response.text, 'html.parser')

land_select = None
for tr in search_soup.find_all('tr'):
    if "Land" in tr.text:
        land_select = tr.find_next('select')
        break

if not land_select:
    print("Land select field not found.")
    exit()

land_codes = [option['value'] for option in land_select.find_all('option') if option['value'] != "0"]

### Foreclosure Auctions
- We first retrieve the main site where all foreclosure listings are available. From there, links can be followed to the specific pages for each foreclosure.
- On the foreclosure-specific page, certain details like the auction date (`Termin`) are consistently formatted.
- However, many other fields, such as the property value (`Verkehrswert in €`), can vary. For instance, they might include multiple objects, like separate listings for land and house, even full descriptions or integer values.
- Documents like the official announcement (`amtliche Bekanntmachung`), exposé, appraisal (`Gutachten`) and photos are provided in PDF format. We'll attempt to extract relevant information from these documents in the information extraction and cleaning notebook.

In [3]:
def get_foreclosure_case_data(link: str) -> ForeclosureCaseModel:
    response = requests.get(link, headers={"Referer": "https://www.zvg-portal.de/index.php?button=Suchen"})
    if not str(response.status_code).startswith("2"):
        raise Exception(f"Request for {link} failed!")
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    details_table = soup.find_all('table')[0]
    rows = details_table.find_all('tr')
    aktenzeichen = rows[0].find_all('td')[0].text.strip()
    letzte_aktualisierung = rows[0].find_all('td')[1].text.strip()
    
    foreclosure_case = {
        "link": link,
        "aktenzeichen": aktenzeichen,
        "letzte_aktualisierung": letzte_aktualisierung
    }
    
    for row in rows[1:]:
        key = row.find_all('td')[0].text.strip()[:-1].lower().replace(" ", "_").replace("/", "_")
        if key in ForeclosureCaseModel.model_fields or key == "verkehrswert_in_€":
            second_cell = row.find_all('td')[1]
            anchor_tags = second_cell.find_all('a')
            img_tags = second_cell.find_all('img')
            if len(anchor_tags) == 1 and len(img_tags) == 1:
                if key == "foto":
                    foreclosure_case.setdefault('foto', []).append(anchor_tags[0]['href'].strip())
                    continue
                value = anchor_tags[0]['href'].strip()
            else:
                value = second_cell.text.strip()

            foreclosure_case[key] = value
    
    return ForeclosureCaseModel(**foreclosure_case)

In [4]:
for land_code in land_codes:
    post_url = "https://www.zvg-portal.de/index.php?button=Suchen&all=1"
    post_data = {
        'ger_name': '--+Alle+Amtsgerichte+--',
        'order_by': '2',
        'land_abk': land_code,
        'ger_id': '0'
    }
    
    post_response = requests.post(post_url, data=post_data)
    if not str(post_response.status_code).startswith("2"):
        logging.error(f"Request for {land_code} failed! Status Code: {post_response.status_code}")
        continue
    
    print(f"Working on {land_code}:")
    land_soup = BeautifulSoup(post_response.text, 'html.parser')
    result_tables = land_soup.find_all('table')
    if not result_tables or len(result_tables) < 2:
        logging.info(f"No foreclosure cases for {land_code}!")
        continue
    
    result_table = result_tables[1]
    new_forclosures = 0
    for tr in result_table.find_all('tr'):
        tds = tr.find_all('td')
        if len(tds) == 3 and tds[0].text.strip() == "Aktenzeichen":
            if not tds[1].find('a'):
                logging.info(f"Foreclosure Case already expired!")
                continue
            
            link = tds[1].find('a')['href']
            case_url = f"https://www.zvg-portal.de/{link}"
            
            if session.query(ForeclosureCaseSchema).filter_by(link=case_url).first():
                logging.warning(f"Foreclosure Case {case_url} already exists in the database!")
                continue
            try:
                foreclosure_case = get_foreclosure_case_data(case_url)
            except Exception as e:
                logging.error(f"Error fetching foreclosure case {case_url}: {e}")
                continue
            
            forclosure_case_db = ForeclosureCaseSchema(**foreclosure_case.model_dump())
            forclosure_case_db.bundesland_code = land_code
            session.add(forclosure_case_db)
            new_forclosures += 1

    print(f"Added {new_forclosures} new forclosures for {land_code}.")
    
session.commit()

Working on bw:
Added 0 new forclosures for bw.
Working on by:
Added 0 new forclosures for by.
Working on be:
Added 0 new forclosures for be.
Working on br:
Added 0 new forclosures for br.
Working on hb:
Added 0 new forclosures for hb.
Working on hh:
Working on he:
Added 0 new forclosures for he.
Working on mv:
Working on ni:
Added 0 new forclosures for ni.
Working on nw:
Added 0 new forclosures for nw.
Working on rp:
Added 0 new forclosures for rp.
Working on sl:
Added 0 new forclosures for sl.
Working on sn:
Added 0 new forclosures for sn.
Working on st:
Added 0 new forclosures for st.
Working on sh:
Working on th:
Added 0 new forclosures for th.


In [5]:
session.close()
engine.dispose()
for handler in logging.getLogger().handlers:
    handler.flush()
    handler.close()